Table of Contents

In [4]:
import Bio
import json
import locale
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import platform
import shutil
import sys
import webbrowser
# Necessary to set font family for Latex in matplotlib:
from matplotlib import rc

locale.setlocale(locale.LC_ALL, '')
o_s = platform.system()
paths = {}

# to make this notebook's output stable across runs
np.random.seed(42)

if o_s == 'Darwin':
    paths['data_dir'] = '/Users/drew/data/Bio'
    paths['sra_tools_dir'] = '/Users/drew/Documents/Data/Bio/\
sratoolkit/bin'
    paths['fastqc'] = '/Applications/FastQC.app/Contents/MacOS/fastqc'
    paths['bbmap_dir'] = '/Users/drew/Documents/Data/Python/bbmap'
    
elif o_s == 'Windows':
    paths['data_dir'] = r'C:\Users\DMacKellar\Documents\Data\Bio\Bmap'
    paths['sra_tools_dir'] = r'C:\Users\DMacKellar\Documents\\
Python\BioPython\Galaxy_rnaseq\sratoolkit\sratoolkit.2.8.2-1-win64\bin'
    paths['fastqc_dir'] = r'C:\Users\DMacKellar\Documents\Python\BioPython\FastQC'
    paths['multiqc_dir'] = r'C:\Users\DMacKellar\Documents\Python\BioPython\MultiQC'
    paths['jdk_dir'] = r'C:\Program Files\Java\jdk1.8.0_101'
    paths['bzip_dir'] = r'C:\Users\DMacKellar\Documents\Python\BioPython\Glimmer\Glimmer-master\src\main\java'
    paths['bbmap_dir'] = r'C:\Users\DMacKellar\Documents\Python\BioPython\BBMap'
    paths['phi_x'] = r'C:/Users/DMacKellar/Documents/Python/BioPython/BBMap/resources/phix174_ill.ref.fa'
    paths['grch38_dir'] = os.path.join(paths['data_dir'], 'grch38')

# rc('font',**{'family':'serif','serif':['DejaVu Sans']})
rc('text', usetex=False)

for root, dirs, files in os.walk(paths['data_dir']):
    for file in files:
        file_under = str(file).replace('.', '_')
        path = os.path.join(root, file)
        paths[file_under] = path
        
sra_table = pd.read_csv(paths['Bmap_SraRunTable_txt'], sep='\t')
paths['credentials_json'] = os.path.join(paths['data_dir'], 'credentials.json')
with open(paths['credentials_json'], 'r') as f:
    credentials = json.load(f)

Additional Inspiration

Some notebooks and other Python code can be found online that offer inspiration/tips/demonstrations in bioinformatics:

RNAseq Alignment Exercise

The point of this notebook is to brush up on some basic Bioinformatics data-wrangling tasks. I've sequenced a novel bacterial genome before, but a more common application relevant to future job searches is aligning RNAseq data, and thereby deriving tissue-specific gene expression patterns.

I got the initial data set from here. I downloaded the data files to:

C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq

on PC and:

/Users/drew/Documents/Data/Python/Galaxy_rnaseq

on Mac.

I'll need fastqc, bowtie, cufflinks, tophat, and stuff. The installation path for Ubuntu on Windows is:

C:\Users\DMacKellar\AppData\Local\Lxss\rootfs

Just the dependencies for fastqc is >300MB, so I'll want to remember this location and possibly uninstall the tools when I'm done with them.

Docs for fastqc are here.

The raw reads for the Illumina BodyMap 2.0 project are here. There are 48 files, each with somewhere around 8E7 spots (potentially representing that many transcripts), for about 6-8E9 Bp per file. A per-run listing is here.

General Points

RNA-seq alignment and quantification involves specific assumptions about methodology that are distinct from those of processing DNA reads for genome sequencing.

General info about the Illumina Body Map 2 is available from several sources. One note on methodology from this talk:

  • The samples used for the 2X50 and 1X75 bp runs are prepared using the Illumina mRNA-Seq kit.
    • These libraries are made from poly-A selected mRNA
    • They are made with a random priming process and are not stranded (i.e., can represent either F or R direction)
    • The insert size for the PE libraries is about ~210 bps

16 Human tissues are represented in the project, and for each tissue they ran one whole lane of a HiSeq2000 sequencer, mixing the paired-end and 1x75 sequences.

As a side project, they pooled total RNA from each of the 16 tissues, then subjected the pooled RNAs to each of three different treatments:

  • Total Poly-A selected mRNA
  • Total Poly-A selected mRNA with Normalization
  • Total RNA – no Poly-A Selection – to enrich for non-coding RNA
    • Complete Transcriptome Library Prep Method to capture all RNA species
    • Uses New Illumina protocol for reducing rRNA in whole transcriptome analysis

Apparently, the point of this latter approach was to validate a new protocol they had developed that was meant to enrich non-rRNA in a sample. From each of these pooled runs they collected 1x100bp reads per sequence.


Quality Control

Assessing sequence and base quality and trimming accordingly can be done with a combination of FastQC and Trimmomatic, or via BioPython.

We can get a preliminary look at the fastq format by simply opening the Bash shell and typing

less [filename.fastq]

In the case of the example files downloaded from the site listed above, the first three reads in 'Galaxy2-[adrenal_1.fastq\].fastqsanger' are:

@ERR030881.107 HWI-BRUNOP16X_0001:2:1:13663:1096#0/1
ATCTTTTGTGGCTACAGTAAGTTCAATCTGAAGTCAAAACCAACCAATTT
+
5.544,444344555CC?CAEF@EEFFFFFFFFFFFFFFFFFEFFFEFFF
@ERR030881.311 HWI-BRUNOP16X_0001:2:1:18330:1130#0/1
TCCATACATAGGCCTCGGGGTGGGGGAGTCAGAAGCCCCCAGACCCTGTG
+
GFFFGFFBFCHHHHHHHHHHIHEEE@@@=GHGHHHHHHHHHHHHHHHHHH
@ERR030881.1487 HWI-BRUNOP16X_0001:2:1:4144:1420#0/1
GTATAACGCTAGACACAGCGGAGCTCGGGATTGGCTAAACTCCCATAGTA
+
55*'+&&5'55('''888:8FFFFFFFFFF4/1;/4./++FFFFF=5:E#

As can be seen, the reads are all cheek-to-jowl, with no lines separating them. The eye is drawn to the whitespace of the lines that contain only '+', but that line doesn't come between different reads, but rather separates the lines containing base pair calls and their respective quality scores. In fact, the line order goes:

  • Unique Read Identifier; details on interpretation available from Wikipedia. The first few characters that are in common among all reads in the file are usually either designate the specific machine on which they were sequenced or, as here, are the string under which the data were registered on the NCBI Sequence Read Archive. Specifically, in this case, the page is here.

  • Base Calls; I believe the only valid values are ['A', 'C', 'G', 'T', and 'N'].

  • A spacer line containing '+'. Despite appearance, it doesn't necessarily denote any kind of assumption about which strand of DNA in the chromosome from which the read originates; they're all supposed to be '+'.

  • Quality scores; base quality scores can be interpreted with the info here. The encoding apparently differs between platform, however, so attention must be paid to what machine/technology generated the reads.

After those four lines, the next read comes immediately, with no separating line.


FastQC

I intend to try BioPython's SeqIO module later, but I am more familiar with FastQC, and will use it first to analyze these example data. FastQC can be run in a GUI mode, in an independent window. Opening a '.fastq' file within that automatically reads all sequences within and offers several tabs of summary data, and prints an icon beside the title of each tab that summarizes whether or not FastQC considers the output problematic for future integration into a mapping/assembly pipeline.

In [180]:
sra_table = pd.read_csv(paths['Bmap_SraRunTable_txt'], sep='\t')

sra_table.head()
Out[180]:
AvgSpotLen BioSample BioSourceProvider Experiment InsertSize LibraryLayout Library_Name MBases MBytes Run ... DATASTORE_filetype DATASTORE_provider Instrument LibrarySelection LibrarySource LoadDate Organism Platform ReleaseDate SRA_Study
0 100 SAMEA962337 NaN ERX011226 0 SINGLE HCT20170 7290 4098 ERR030856 ... sra ncbi Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC 2014-05-30 Homo sapiens ILLUMINA 2011-03-17 ERP000546
1 100 SAMEA962337 NaN ERX011216 0 SINGLE HCT20170 7461 4190 ERR030857 ... sra ncbi Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC 2014-05-30 Homo sapiens ILLUMINA 2011-03-17 ERP000546
2 100 SAMEA962337 NaN ERX011224 0 SINGLE HCT20170 7365 4151 ERR030858 ... sra ncbi Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC 2014-05-30 Homo sapiens ILLUMINA 2011-03-17 ERP000546
3 100 SAMEA962346 NaN ERX011189 0 SINGLE HCT20172 7274 4097 ERR030859 ... sra ncbi Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC 2014-05-30 Homo sapiens ILLUMINA 2011-03-17 ERP000546
4 100 SAMEA962346 NaN ERX011223 0 SINGLE HCT20172 7241 4062 ERR030860 ... sra ncbi Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC 2014-05-30 Homo sapiens ILLUMINA 2011-03-17 ERP000546

5 rows × 30 columns

In [183]:
sra_table.columns
# sra_table['organism_part'].unique()
sra_table[sra_table['organism_part'] == 'brain']
Out[183]:
AvgSpotLen BioSample BioSourceProvider Experiment InsertSize LibraryLayout Library_Name MBases MBytes Run ... DATASTORE_filetype DATASTORE_provider Instrument LibrarySelection LibrarySource LoadDate Organism Platform ReleaseDate SRA_Study
26 100 SAMEA962344 Human brain total RNA, Ambion ERX011200 264 PAIRED HCT20160 7010 4016 ERR030882 ... sra ncbi Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC 2014-05-30 Homo sapiens ILLUMINA 2011-03-17 ERP000546
34 75 SAMEA962344 Human brain total RNA, Ambion ERX011186 0 SINGLE HCT20160 4600 3490 ERR030890 ... sra ncbi Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC 2014-05-30 Homo sapiens ILLUMINA 2011-03-17 ERP000546

2 rows × 30 columns

As the website listed above for this exercise notes, FastQC was mostly built around genome assembly applications, and so the quality scores and assumptions about suitability for each sequence or base may not always be appropriate for other purposes, such as RNA-seq. So the icons summarizing each tab aren't to be considered definitive.

In [5]:
from matplotlib import pyplot as plt
from matplotlib import image
import os

plt.close('all')

fastq3_dir = r'C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\\
Galaxy3-[adrenal_2.fastq].fastqsanger_fastqc\Galaxy3-[adrenal_2.fastq].fastqsanger_fastqc\Images'

# fig = plt.figure(figsize=(20, 40))
my_dpi = 96
# plt.figure(figsize=(60, 80))#600/my_dpi, 800/my_dpi), dpi=my_dpi)
fig_scale = 500
fig, axs = plt.subplots(2, 1, figsize=(6*fig_scale/my_dpi, 8*fig_scale/my_dpi), dpi=my_dpi)
# ax1 = fig.add_subplot(211)
# ax2 = fig.add_subplot(212)

axs[0].imshow(image.imread(os.path.join(fastq3_dir, 'per_base_quality.png')))
axs[1].imshow(image.imread(os.path.join(fastq3_dir, 'per_sequence_quality.png')))

fig.suptitle('Galaxy3-[adrenal_2]', fontsize=0.08 * fig_scale)
# plt.tight_layout()
# axs[0].set_title('Galaxy3-[adrenal_2]')
plt.show()

I would like a general-purpose function to print specific graphs from each FastQC output. Once FastQC has run, it outputs a zip file. When extracted, this has the general structure

[dir the fastq file was in] / [dir with same name as the fastq file]\_fastqc / 'Images' / 'per_base_quality.png'

or

[dir the fastq file was in] / [dir with same name as the fastq file]\_fastqc / 'Images' / 'per_sequence_quality.png'
In [158]:
import os
import re
import matplotlib.pyplot as plt
from collections import defaultdict

def plot_fastqc(outer_dir):
    fastqc_plots = defaultdict()
    pattern1 = re.compile(r'(.*)\.fastq.*')
    
    # get all files and subdirectories in the dir containing the fastq files
    for root, dirs, files in os.walk(outer_dir):
        for file in files:
            # look for the 'per_base_quality' files
            if file == 'per_base_quality.png':
                # split the path to that file into drives/dirs
                subdirs = root.split(os.sep)
                # get just the part of the penultimate dir that's unique
                # put it in the dict, make a subdict called per_base and set the
                # image as its item
                fastq_name = re.search(pattern1, subdirs[-2]).group(1)
                fastqc_plots[fastq_name] = {}
                fastqc_plots[fastq_name].update({'per_base' : os.path.join(root, file)})
            elif file == 'per_sequence_quality.png':
                subdirs = root.split(os.sep)
                fastq_name = re.search(pattern1, subdirs[-2]).group(1)
                fastqc_plots[fastq_name].update({'per_sequence' : os.path.join(root, file)})
            elif file == 'per_base_sequence_content.png':
                subdirs = root.split(os.sep)
                fastq_name = re.search(pattern1, subdirs[-2]).group(1)
                fastqc_plots[fastq_name].update({'per_base_sequence_content' : os.path.join(root, file)})
#             elif file == 'kmer_profiles.png':
#                 subdirs = root.split(os.sep)
#                 fastq_name = re.search(pattern1, subdirs[-2]).group(1)
#                 blah = fastqc_plots.get(fastq_name, None)
#                 print('That worked.', fastqc_plots[re.search(pattern1, subdirs[-2]).group(1)])#, fastqc_plots[fastq_name])
#                 fastqc_plots[fastq_name].update({'kmer_profiles' : os.path.join(root, file)})
#                 blah.update({'kmer_profiles' : os.path.join(root, file)})
#                 fastqc_plots[fastq_name].update({'kmer_profiles' : os.path.join(root, file)})
            else:
                continue
    
    # Now, plot the images
    plt.close('all')
    my_dpi = 96
    fig_scale = 500
    fig, axs = plt.subplots(3, len(fastqc_plots), 
                            figsize=(6*fig_scale/my_dpi, 4*fig_scale/my_dpi), 
                            dpi=my_dpi)
    tick_params1 = { 'axis': 'x', 'which': 'both', 'bottom': 'off', 'top': 'off', 'labelbottom': 'off'}

    for i, fastq_name in enumerate(fastqc_plots):
        axs[0, i].imshow(plt.imread(fastqc_plots[fastq_name]['per_base']))
        axs[0, i].axis('off')
        axs[0, i].set_title(fastq_name, fontsize=fig_scale/17)
        axs[1, i].imshow(plt.imread(fastqc_plots[fastq_name]['per_sequence']))
        axs[1, i].axis('off')
        axs[2, i].imshow(plt.imread(fastqc_plots[fastq_name]['per_base_sequence_content']))
        axs[2, i].axis('off')
#         axs[3, i].imshow(plt.imread(fastqc_plots[fastq_name]['kmer_profiles']))

    plt.subplots_adjust(wspace=0.05, hspace=0)
    plt.show()
    return fastqc_plots
In [164]:
outer_dir = r'C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq'
fastqc_plots = plot_fastqc(outer_dir)

Note: It's really strange; I had originally designed the plot_fastqc function to yield 4 graphs, but whether I try 'kmer_profiles' or 'duplication_levels' as the final field, it always returns an error; either about the key (which would be the same dict key as in the previous 3 fields) or saying 'Nonetype has no attribute 'update''. I copied and pasted the code that generates that key for each of the 'elif' statements, so there's no reason it could have gone wrong. I think there's something wrong with Python or IPython. In any case, for now, rather than re-working it, I'm just going to omit the fourth graph and move on.</font color='red'>

Plotting them this way makes the graphs a little small, but it is at least possible to see the overall trends. For instance, the 'Galaxy4' run was particularly poor, especially for the first 10 bases.

Another option is to take the FastQC output 'data.txt' file for each FastQ file, parse the info inside, and re-plot in Matplotlib. Actually, that's kind of ugly, since the FastQ file just returns summary stats (mean, median, quartiles), rather than raw data, and matplotlib box plots are calculated from raw data. You could fudge the summary stats into a form plottable by matplotlib, but it seems like extra work.

As an alternative, at this point, I feel like switching over to BioPython, rather than carrying through Trimmomatic. We'll see how that goes.

In [8]:
from Bio import SeqIO
import pandas as pd
import os
import numpy as np

start_dir_PC = r'C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq'
start_dir_Mac = '/Users/drew/Documents/Data/Python/Galaxy_rnaseq/'

def parse_fastq(start_dir):
    biopy_fastqs = {}
    for file in os.listdir(start_dir):
        if file.split(sep='.')[-1] == 'fastqsanger':
            name = file.split(sep='.')[0]+']'
            file_path = os.path.join(start_dir, file)
            biopy_fastqs[name] = {}
            biopy_fastqs[name].update({'path': os.path.join(start_dir, file),
                                       'seqs': []})
            for record in SeqIO.parse(file_path, "fastq"):
                biopy_fastqs[name]['seqs'].append({'id': record.id, 
                                                   'sequence': np.array(list(str(record.seq))),
                                                   'quality': np.array(record.letter_annotations['phred_quality'])})
            biopy_fastqs[name]['df'] = pd.DataFrame.from_dict(biopy_fastqs[name]['seqs'])
    return biopy_fastqs
In [9]:
# biopy_fastqs = parse_fastq(start_dir_Mac)
biopy_fastqs = parse_fastq(start_dir_PC)

biopy_fastqs['Galaxy2-[adrenal_1]']['df'].head()
Out[9]:
id quality sequence
0 ERR030881.107 [20, 13, 20, 19, 19, 11, 19, 19, 19, 18, 19, 1... [A, T, C, T, T, T, T, G, T, G, G, C, T, A, C, ...
1 ERR030881.311 [38, 37, 37, 37, 38, 37, 37, 33, 37, 34, 39, 3... [T, C, C, A, T, A, C, A, T, A, G, G, C, C, T, ...
2 ERR030881.1487 [20, 20, 9, 6, 10, 5, 5, 20, 6, 20, 20, 7, 6, ... [G, T, A, T, A, A, C, G, C, T, A, G, A, C, A, ...
3 ERR030881.9549 [35, 27, 31, 35, 35, 32, 31, 32, 25, 32, 39, 3... [A, A, C, G, G, A, T, C, C, A, T, T, G, T, T, ...
4 ERR030881.13497 [37, 31, 37, 37, 37, 38, 38, 37, 38, 37, 39, 3... [G, C, T, A, A, T, C, C, G, A, C, T, T, C, T, ...
In [10]:
biopy_fastqs['Galaxy2-[adrenal_1]']['seqs'][:2]
Out[10]:
[{'id': 'ERR030881.107',
  'quality': array([20, 13, 20, 19, 19, 11, 19, 19, 19, 18, 19, 19, 20, 20, 20, 34, 34,
         30, 34, 32, 36, 37, 31, 36, 36, 37, 37, 37, 37, 37, 37, 37, 37, 37,
         37, 37, 37, 37, 37, 37, 37, 37, 36, 37, 37, 37, 36, 37, 37, 37]),
  'sequence': array(['A', 'T', 'C', 'T', 'T', 'T', 'T', 'G', 'T', 'G', 'G', 'C', 'T',
         'A', 'C', 'A', 'G', 'T', 'A', 'A', 'G', 'T', 'T', 'C', 'A', 'A',
         'T', 'C', 'T', 'G', 'A', 'A', 'G', 'T', 'C', 'A', 'A', 'A', 'A',
         'C', 'C', 'A', 'A', 'C', 'C', 'A', 'A', 'T', 'T', 'T'], dtype='<U1')},
 {'id': 'ERR030881.311',
  'quality': array([38, 37, 37, 37, 38, 37, 37, 33, 37, 34, 39, 39, 39, 39, 39, 39, 39,
         39, 39, 39, 40, 39, 36, 36, 36, 31, 31, 31, 28, 38, 39, 38, 39, 39,
         39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39]),
  'sequence': array(['T', 'C', 'C', 'A', 'T', 'A', 'C', 'A', 'T', 'A', 'G', 'G', 'C',
         'C', 'T', 'C', 'G', 'G', 'G', 'G', 'T', 'G', 'G', 'G', 'G', 'G',
         'A', 'G', 'T', 'C', 'A', 'G', 'A', 'A', 'G', 'C', 'C', 'C', 'C',
         'C', 'A', 'G', 'A', 'C', 'C', 'C', 'T', 'G', 'T', 'G'], dtype='<U1')}]

Interestingly, on the Mac, that generates an error (warning?) about the IOPub data rate being exceeded. The notebook server docs output suggests setting a flag upon launching the server, which sounds like a pain. I found this suggestion about changing a config file. I'll try that now.

$ jupyter notebook --generate-config
Writing default config to: /Users/drew/.jupyter/jupyter_notebook_config.py
$ sudo nano /Users/drew/.jupyter/jupyter_notebook_config.py

Searched and found (line 155):

\# c.NotebookApp.iopub_data_rate_limit = 1000000

So I just added a line below it without the comment char, and with one more zero:

c.NotebookApp.iopub_data_rate_limit = 10000000

That appears to work; I don't get the warning now.

In [11]:
df1 = biopy_fastqs['Galaxy2-[adrenal_1]']['df']

df1.shape
Out[11]:
(50121, 3)
In [12]:
df1.iloc[:2, 1].as_matrix()
Out[12]:
array([array([20, 13, 20, 19, 19, 11, 19, 19, 19, 18, 19, 19, 20, 20, 20, 34, 34,
       30, 34, 32, 36, 37, 31, 36, 36, 37, 37, 37, 37, 37, 37, 37, 37, 37,
       37, 37, 37, 37, 37, 37, 37, 37, 36, 37, 37, 37, 36, 37, 37, 37]),
       array([38, 37, 37, 37, 38, 37, 37, 33, 37, 34, 39, 39, 39, 39, 39, 39, 39,
       39, 39, 39, 40, 39, 36, 36, 36, 31, 31, 31, 28, 38, 39, 38, 39, 39,
       39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39])],
      dtype=object)

What I'm looking for here is a quick way to iterate over the rows of this series, and create a new array with the same index of each row as a new row, etc. Or else, split them out into individual arrays, then perform a transpose operation and recombine them. A good start is just to check out the numpy array methods available.

Reshape might be suitable, maybe while specifying 'order="F"'? The problem there is that I don't know how to specify the 'newshape' arg. The array should have a shape of 50121 rows, 1 column, and each element in each cell is another array with shape (1, 50). I guess it's actually listed '(50,)'.

hstack/vstack?

Ok, I think I've got it now; vstack then transpose seems to work. The problem appears to be that having a numpy array where each element within is itself an array is not the same as having a multidimensional numpy array. That is, nested arrays are not quite the same thing as a contiguous array. I guess.

See below for a demonstration of how Python interprets some of these structures. I don't fully understand it myself, but I knew that there must be a builtin method to flatten and then rearrange the data without using a 'for' loop.

In [13]:
q1 = df1.iloc[:, 1]

print('1\t', type(q1), q1.shape)
print('2\t', type(q1), q1.shape)
print('3\t', type(np.vstack(q1)), np.vstack(q1).shape, '\n')

print('4\t', type(q1.as_matrix()), q1.as_matrix().shape)
print('5\t', type(np.vstack(q1.as_matrix())), '\n')

print('6\t', q1.as_matrix()[:2], '\n')
print('7\t', np.vstack(q1.as_matrix()[:2]), '\n')
print('8\t', np.vstack(q1.as_matrix()[:2]).T[:5], '\n')

print('9\t', type(q1.as_matrix()[0]), q1.as_matrix()[0].shape)
print('10\t', type(q1[0]), q1[0].shape)

# The line below yields AttributeError: 
# 'numpy.ndarray' object has no attribute 'as_matrix'
# print('10\t', type(np.vstack(q1.as_matrix()[0])), np.vstack(q1[0].as_matrix()[0]).shape)
1	 <class 'pandas.core.series.Series'> (50121,)
2	 <class 'pandas.core.series.Series'> (50121,)
3	 <class 'numpy.ndarray'> (50121, 50) 

4	 <class 'numpy.ndarray'> (50121,)
5	 <class 'numpy.ndarray'> 

6	 [array([20, 13, 20, 19, 19, 11, 19, 19, 19, 18, 19, 19, 20, 20, 20, 34, 34,
       30, 34, 32, 36, 37, 31, 36, 36, 37, 37, 37, 37, 37, 37, 37, 37, 37,
       37, 37, 37, 37, 37, 37, 37, 37, 36, 37, 37, 37, 36, 37, 37, 37])
 array([38, 37, 37, 37, 38, 37, 37, 33, 37, 34, 39, 39, 39, 39, 39, 39, 39,
       39, 39, 39, 40, 39, 36, 36, 36, 31, 31, 31, 28, 38, 39, 38, 39, 39,
       39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39])] 

7	 [[20 13 20 19 19 11 19 19 19 18 19 19 20 20 20 34 34 30 34 32 36 37 31 36
  36 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 36 37 37 37 36 37
  37 37]
 [38 37 37 37 38 37 37 33 37 34 39 39 39 39 39 39 39 39 39 39 40 39 36 36
  36 31 31 31 28 38 39 38 39 39 39 39 39 39 39 39 39 39 39 39 39 39 39 39
  39 39]] 

8	 [[20 38]
 [13 37]
 [20 37]
 [19 37]
 [19 38]] 

9	 <class 'numpy.ndarray'> (50,)
10	 <class 'numpy.ndarray'> (50,)
In [14]:
type(q1)
Out[14]:
pandas.core.series.Series

So if I can get them in that format, the question now is the best way to build that into the existing function or add a new function. I suppose that ultimately the best way to handle this is to make it a class, but I'm still not that up on building those.

The 'per_base' scores are attributes of the entire FastQ file, not any particular sequence within it, so I'll put them in the higher-level dict first, then copy to the df. The ones I might care about are 'per_base_quality', 'per_base_sequence_content', and 'per_base_N_content'.

...Actually, that might not work well. The higher-order dict entries are all distinct entities for which there's no obvious 'round-these-all-up-in-numpy' function. I might have to filter it through pandas anyway, to be able to get them out as a 'to_matrix()' kind of functionality. Note: there probably should be a way to pass the sequence elements from the dict iteratively into a new array (perhaps using the fromiter method) before breaking them out into a single multidimensional array and transposing, but rather than learn about that I'll just try building the dataframe first, then using that to get the combined arrays. It's still the case that the output should be stored in the dict associated with each FastQ file, however; attempting to append these values to the pandas DF fails because the length of the DF's index doesn't match the length of the transposed array of 'per_base' scores.

  • For per_base_N_content, I want to take the vstack().T array and, per row in the array, output a count of just 'N' values. It sounds like the numpy.where() method might be best for that. Alternatively, maybe np.unique()?

    • As this post points out, the behavior of np.where() is, if the input array is 2D, as here, to output two arrays: the first contains a list of all rows where the condition is met, the second contains a list of all columns where the condition is met.

    • The two output arrays of np.where() should therefore be the exact same shape, a tuple of form '(x, )', where x is equal to the exact count of how many times the condition was met.

    • From that understanding, there's really no need to convert the sequence array into a per-base-seq array using the whole vstack().T approach. It should suffice to take the sequence of an entire FastQ file as an array, where each row is a read and each column is a position from 0 to 50 within each read, feed it through np.where(), take the second array of its output, which gives column values, then count how often each int appears, and plot that.

    • Actually, the approach listed in the bullet point immediately above sounds unnecessarily complex. It should be easier to combine 'np.vstack()' with 'np.unique( , return_counts=True, axis=1)'... but for some reason that keeps the elements all separate; it outputs in an unexpected format. I see now that I was using 'np.nditer()', and by taking a closer look at these docs, I see that that method is meant not to iterate over just rows within an array, but over individual values. It looks like numpy is still meant to use for loops to iterate over entire rows within an array.

  • For per_base_sequence_content, I'll probably want a variant of the same output: the where() method might work.

  • For per_base_quality, I can just leave the scores as is, then call a box-and-whiskers plot on them later.

The only 'per_sequence' score that I might want is probably 'per_sequence_quality' (average quality per read).

Actually, for each FastQ file, I'd also like to know how many duplicates are present; that seems like potentially useful info. Since these files are RNAseq data, and may represent truly useful info about sequence abundance in the original sample, I probably won't end up deleting duplicates, but it still seems like important info to track. The interpretation of the equivalent example from FastQC is a tad nuanced, but I think I can get through it.

The raw value_counts are not super graph-friendly, because of how large the number of one-offs are present; I could try log-log scales (or semi-log), but FastQC's choice of bins of 1-10, 50, 500, and 5,000 counts might be a good format to copy. As such, the pandas.cut() method would be a convenient way to apply them.

In [165]:
from Bio import SeqIO
import pandas as pd
import os
import numpy as np
from collections import defaultdict

start_dir_PC = r'C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq'
start_dir_Mac = '/Users/drew/Documents/Data/Python/Galaxy_rnaseq/'

def parse_fastq(start_dir):
    biopy_fastqs = {}
    phred_scale = 39   # Assuming Illumina data; change if diff qual score scale
    for file in os.listdir(start_dir):
        if file.split(sep='.')[-1] in ['fastqsanger', 'fastq']:
            name = file.split(sep='.')[0]+']'
            file_path = os.path.join(start_dir, file)
            biopy_fastqs[name] = {}
            biopy_fastqs[name].update({'path': os.path.join(start_dir, file),
                                       'seqs': []})
            for record in SeqIO.parse(file_path, "fastq"):
                biopy_fastqs[name]['seqs'].append({'id': record.id, 
                                                   'sequence': np.array(list(str(record.seq))),
                                                   'seq_whole': str(record.seq),
                                                   'quality': np.array(record.letter_annotations['phred_quality'])})
                
            biopy_fastqs[name]['df'] = pd.DataFrame.from_dict(biopy_fastqs[name]['seqs'])
            biopy_fastqs[name]['per_base_qual'] = np.vstack(biopy_fastqs[name]['df']['quality']).T
            biopy_fastqs[name]['per_base_seq'] = np.vstack(biopy_fastqs[name]['df']['sequence']).T
            
            # Count base calls to dicts; one for raw counts, one for percentages:
            biopy_fastqs[name]['per_base_counts'] = []
            biopy_fastqs[name]['per_base_percents'] = []
            
            for i, row in enumerate(biopy_fastqs[name]['per_base_seq']):
                biopy_fastqs[name]['per_base_counts'].append(defaultdict())
                unique, counts = np.unique(row, return_counts=True)
                biopy_fastqs[name]['per_base_counts'][i] = dict(zip(unique, counts))
                
            for i, dictionary in enumerate(biopy_fastqs[name]['per_base_counts']):
                biopy_fastqs[name]['per_base_percents'].append(defaultdict())
                for k in dictionary.keys():
                    biopy_fastqs[name]['per_base_percents'][i][k] = dictionary[k] / sum(dictionary.values())
                        
            # Let's put the counts into a 'per_base_df':
            biopy_fastqs[name]['per_base_seq_df'] = pd.DataFrame.from_dict(
                biopy_fastqs[name]['per_base_counts']).fillna(value=0)
            biopy_fastqs[name]['per_base_percents_df'] = pd.DataFrame.from_dict(
                biopy_fastqs[name]['per_base_percents']).fillna(value=0)
            biopy_fastqs[name]['per_base_qual_df'] = pd.DataFrame(
                biopy_fastqs[name]['per_base_qual']).fillna(value=0)
            
            # add values for duplication level
            biopy_fastqs[name]['duplicates'] = biopy_fastqs[name]['df'].groupby(['seq_whole']).size()\
            .value_counts().sort_index()

            # add mean per_seq_qual
            biopy_fastqs[name]['mean_seq_qual'] = pd.cut(biopy_fastqs[name]['df']['quality'] \
                                                       .apply(np.mean), range(1, 41)).value_counts().sort_index()
            biopy_fastqs[name]['mean_seq_qual'].index = np.arange(2, phred_scale + 2)

            
    
    return biopy_fastqs
In [166]:
biopy_fastqs = parse_fastq(start_dir_PC)
# biopy_fastqs = parse_fastq(start_dir_Mac)
In [167]:
for x in biopy_fastqs:
    print(x)

print('\n')

for x in biopy_fastqs['Galaxy2-[adrenal_1]']: 
    print(x, type(biopy_fastqs['Galaxy2-[adrenal_1]'][x]))
Galaxy2-[adrenal_1]
Galaxy3-[adrenal_2]
Galaxy4-[brain_1]
Galaxy5-[brain_2]


path <class 'str'>
seqs <class 'list'>
df <class 'pandas.core.frame.DataFrame'>
per_base_qual <class 'numpy.ndarray'>
per_base_seq <class 'numpy.ndarray'>
per_base_counts <class 'list'>
per_base_percents <class 'list'>
per_base_seq_df <class 'pandas.core.frame.DataFrame'>
per_base_percents_df <class 'pandas.core.frame.DataFrame'>
per_base_qual_df <class 'pandas.core.frame.DataFrame'>
duplicates <class 'pandas.core.series.Series'>
mean_seq_qual <class 'pandas.core.series.Series'>
In [168]:
per_seq_qual = biopy_fastqs['Galaxy2-[adrenal_1]']['df']['quality'].mean()
plt.hist(per_seq_qual)
plt.xlim([0, 40])
plt.show()
In [19]:
biopy_fastqs['Galaxy2-[adrenal_1]']['per_base_seq_df'].head()
Out[19]:
A C G N T
0 5220 22899 16298 28.0 5676
1 9635 9188 16905 19.0 14374
2 9537 13661 17916 0.0 9007
3 15265 10913 16628 14.0 7301
4 13011 10464 18146 1.0 8499
In [20]:
biopy_fastqs['Galaxy2-[adrenal_1]']['per_base_percents_df'].head()
Out[20]:
A C G N T
0 0.104148 0.456874 0.325173 0.000559 0.113246
1 0.192235 0.183316 0.337284 0.000379 0.286786
2 0.190280 0.272560 0.357455 0.000000 0.179705
3 0.304563 0.217733 0.331757 0.000279 0.145667
4 0.259592 0.208775 0.362044 0.000020 0.169570
In [21]:
import matplotlib.pyplot as plt

biopy_fastqs['Galaxy2-[adrenal_1]']['per_base_percents_df'].plot()
plt.ylim([0, 1])
plt.show()

Now, put them together into a function. I think I'm going to need this info.

In [169]:
import matplotlib.pyplot as plt

def plot_fastq(biopy_fastqs): 
    batches = int(np.ceil(len(biopy_fastqs) / 4))
    plt.close('all')
    plt.style.use('ggplot')
    my_dpi = 96
    fig_scale = 500
    # put at most 4 fastq files per row
    fig, axs = plt.subplots(nrows=5*batches, ncols=4, 
                            figsize=(4*fig_scale/my_dpi, 4*batches*fig_scale/my_dpi), 
                            dpi=my_dpi)

    for i, fastq_name in enumerate(sorted(biopy_fastqs)):
        batch = int(np.floor(i / 4))
        i2 = i % 4
        axs[batch, i%4].set_title(fastq_name, fontsize=fig_scale/25)
        axs[batch, 0].set_ylabel('Per base \nQuality', fontsize=fig_scale/25)
        axs[batch+1, 0].set_ylabel('Per base \nSequence \nContent', fontsize=fig_scale/25)
        axs[batch+2, 0].set_ylabel('Per base \nN Content', fontsize=fig_scale/25)
        axs[batch+3, 0].set_ylabel('Per Sequence \nMean Quality\nScore', fontsize=fig_scale/25)
        axs[batch+4, 0].set_ylabel('# of Seq. vs. \nDuplication \nLevel', fontsize=fig_scale/25)
        biopy_fastqs[fastq_name]['per_base_qual_df'].T.plot(kind='box', 
                                                            showfliers=False, 
                                                            ax=axs[0, i], 
                                                            sharey=True,
                                                            ylim=[0, 40])
        biopy_fastqs[fastq_name]['per_base_percents_df'].plot(kind='line', 
                                                              ax=axs[1, i],
                                                              sharey=True,
                                                              ylim=[0, 1])
        biopy_fastqs[fastq_name]['per_base_percents_df']['N'].plot(kind='line', 
                                                                   ax=axs[2, i],
                                                                   sharey=True,
                                                                   ylim=[0, 0.1])
        biopy_fastqs[fastq_name]['mean_seq_qual'].plot(kind='line',
                                                       ax=axs[3, i],
                                                       sharey=True)
        axs[3, i].set_ylim(top=12000)
#         axs[3, i].set_xticklabels(list(range(1, 41)))
#         axs[3, i].set_yscale('log')
        biopy_fastqs[fastq_name]['duplicates'].plot(kind='line',
                                              ax=axs[4, i],
                                              sharey=True)
        axs[4, i].set_xscale('log')
        axs[4, i].set_yscale('log')
        axs[4, i].set_ylim(bottom=0.5)
        


    plt.subplots_adjust(wspace=0.05, hspace=0.15)
    plt.show()
In [170]:
plot_fastq(biopy_fastqs)

I think that those results look pretty similar to the FastQC results, but just to be sure that I preserved the right relationships when applying the operations to calculate the per_base scores, I'd like to compare the summary statistics of the per_base_qual from BioPython to those output from the FastQC_data files:

In [24]:
# This cell to set-up comparison of per_base_qual stats
# derived from FastQC software to those from my biopython wrangling

from Bio import SeqIO
import pandas as pd
import os
import numpy as np
from collections import defaultdict

start_dir_PC = r'C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq'
start_dir_Mac = '/Users/drew/Documents/Data/Python/Galaxy_rnaseq/'

def parse_fastqc_data(start_dir):
    fastqc_data = defaultdict()
    for root, subdirs, files in os.walk(start_dir):
        for file in files:
            if file == 'fastqc_data.txt':
                    name = os.path.split(root)[-1].split(sep='.')[0]+']'
                    fastqc_data[name] = {}
                    with open(os.path.join(root, file)) as f:
                        # Summary data for per_base_quality seems to be on lines 14-63
                        lines = [line.rstrip('\n') for line in f]
                        fastqc_data[name]['lines'] = []
                        for line in lines[13:63]:
                            fastqc_data[name]['lines'].append(line.split(sep='\t'))
                            
                    fastqc_data[name]['compare'] = pd.DataFrame()
                    fastqc_data[name]['compare']['biopy'] = \
                        biopy_fastqs[name]['per_base_qual_df'].T.mean(axis=0)
                    fastqc_data[name]['compare']['fastqc'] = \
                        [float(stats[1]) for stats in fastqc_data[name]['lines'][:]]
                    fastqc_data[name]['compare']['ratio'] = (
                        fastqc_data[name]['compare']['biopy'] / fastqc_data[name]['compare']['fastqc'])
            else:
                continue

    return fastqc_data
In [25]:
fastqc_data = parse_fastqc_data(start_dir_PC)

for name in fastqc_data:
    print(name, fastqc_data[name]['compare']['ratio'].describe())
Galaxy2-[adrenal_1] count    50.0
mean      1.0
std       0.0
min       1.0
25%       1.0
50%       1.0
75%       1.0
max       1.0
Name: ratio, dtype: float64
Galaxy3-[adrenal_2] count    50.0
mean      1.0
std       0.0
min       1.0
25%       1.0
50%       1.0
75%       1.0
max       1.0
Name: ratio, dtype: float64
Galaxy4-[brain_1] count    50.0
mean      1.0
std       0.0
min       1.0
25%       1.0
50%       1.0
75%       1.0
max       1.0
Name: ratio, dtype: float64
Galaxy5-[brain_2] count    50.0
mean      1.0
std       0.0
min       1.0
25%       1.0
50%       1.0
75%       1.0
max       1.0
Name: ratio, dtype: float64

Ok, that was a bit of overkill, but it shows definitively that the biopython-based wrangling of the data into per_base statistics gives the same result in all cases as does FastQC.

So at this point I've ensured that I can rearrange data from BioPython's SeqIO module that can gather the same summary stats as the FastQC tool, and I've gotten a decent overall view of the associated plots. The problem is that the individual plots are a little too small within the IPython/Jupyter Notebook window to get all of the essential data out, so I'd like to generate an interactive plot that can enlarge individual subplots upon mousing over or clicking. (As a more minor point, if > 4 fastq files are present in the host dir, you could end up with too many columns to read, no matter how big you make the window, so there should be a line to wrap the axes into columns of 4 for future use.)

This could end up being a bit too long an undertaking at this point; I may earmark it for future improvement. Best bets for interactive visualization at this point appear to be Bokeh, which sounds like it's basically a Python wrapper for the D3.js (javascript-based) library, along with the associated tool Holoviews, which sounds like it's mostly meant to simplify some of the gruntwork of setting up a visualization using tools like Bokeh.

Once the visualization is suitably handled, the next task is trimming. As you can see from this example, the Galaxy2 reads are all pretty high quality, but some of the other runs are pretty bad. In the past, after checking reads with FastQC, I've run them through the Trimmomatic tool to cut off parts of reads that don't reach a quality threshold. Another option, to which I was directed from a forum discussing BioPython, is sickle; it's not native Python, however, so it probably wouldn't be any easier to use from within Python than Trimmomatic.

Actually, first it might be best to throw out entire reads if they're poor quality; such as those with a mean qual below, say, 30? Then, trim individual bases by quality score from the reads that pass that filter. When trimming, you obviously don't want to just delete the low-quality bases, since that may end up concatenating better bases on either side. Rather, delete the offending base and all those 5' or 3' to it, depending on whether you're trimming from the 5' or 3' end, both of which tend to have a higher proportion of low quality scores, while the middle of the read is generally better. I could define these as two different functions, one that reads progressively more 3' up to base #25, and one that starts at the 3' end and reads upstream until the midpoint of the read, as well.

Finally, since tiny reads are likely to map to multiple reference sequences ambiguously, I'd want to throw out any reads that were too short after trimming for quality. Say, <35bp in length.

When it comes to deciding upon quality cutoffs, length cutoffs, and other such important scores to decide in the trimming process, I find Biostars to be the website with the most useful discussions.

Actually, there's one more factor that I've been overlooking: these reads are from a paired-end dataset. I don't have any easy way to handle that in BioPython right now. Maybe I should stop re-inventing the wheel with this analysis and just bring in some other bioinformatics packages.

Trimmomatic expects paired reads to be fed in as separate files. Though I've been ignoring the biological details of the derivation of these data sets, it would appear that we in fact have two different sets of paired end reads from two separate libraries with these four files: the first few reads in Galaxy2's identifiers match that of Galaxy3; Galaxy2's have the '/1' suffix, and Galaxy3's have the '/2', so the order is what you'd expect.

Therefore, I can process these with Trimmomatic directly. I could write a Python-based pipeline to execute this, but that might be awkward given the need to switch to the Unix shell. Eh, I'll try it.


Note: on the PC, I tried 'sudo apt-get install trimmomatic' in the Ubuntu shell window, and got:

Selecting previously unselected package trimmomatic.
(Reading database ... 27039 files and directories currently installed.)
Preparing to unpack .../trimmomatic_0.35+dfsg-1_all.deb ...
Unpacking trimmomatic (0.35+dfsg-1) ...
Processing triggers for man-db (2.7.5-1) ...
Setting up trimmomatic (0.35+dfsg-1) ...

as output. Checking the expected (default) install dir, I find the executable at:

/usr/bin/TrimmomaticPE</font>


Example input for PE:

java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
In [26]:
import sys, os

def trimmomatic_paired(biopy_fastqs, mac_pc='mac', **kwargs):
    
    # Set the path for trimmomatic based on which laptop I'm using
    rand_entry = list(biopy_fastqs.keys())[0]
    if mac_pc == 'mac':
        trim_path = '/Users/drew/Documents/Data/Bio/\
Trimmomatic-0.36/trimmomatic-0.36.jar'
        commands = []
        wd = os.path.dirname(biopy_fastqs[rand_entry]['path'])
        sep = ';'
    elif mac_pc == 'pc':
        trim_path = r'C:\Users\DMacKellar\Documents\Python\\
BioPython\Trimmomatic-0.36\trimmomatic-0.36.jar'
        commands = []
        wd = os.path.dirname(biopy_fastqs[rand_entry]['path'])
        sep = ' & '
    else:
        print('Please specify "mac_pc" arg as "mac" or "pc"')
    
    adapter_path = os.path.join(os.path.dirname(trim_path), 'adapters')
    # Confirm that the file exists
    try:
        os.path.isfile(trim_path)
    except:
        print('trimmomatic.jar file not found at expected location')
        sys.exit(1)   # abort the function
    
    # Confirm that the input dict 
    # contains an even number of fastq files
    try:
        len(biopy_fastqs) % 2 != 0
    except:
        print('Input has odd number of fastq files; \
\nPlease enter paired end data.')
        sys.exit(1)   # abort the function
        
    use_kwargs = {}
    
    # Specify valid kwarg keys and default values
    default_kwargs = {
        'phred': '-phred33',
        'ILLUMINACLIP': '',
        'LEADING': 'LEADING:25',
        'TRAILING': 'TRAILING:25',
        'SLIDINGWINDOW': '',
        'MINLEN': 'MINLEN:35'
    }
    
    # For any input kwargs, pass them to the use_kwargs dict
    for key, value in kwargs.items():
        if key in default_kwargs:
            use_kwargs[key] = value
            if (key=='ILLUMINACLIP') & (value != ''):
                illclip_split = value.split(sep=':')
                adapter_file = os.path.join(adapter_path,
                                            illclip_split[1])
                try:
                    os.path.isfile(adapter_file)
                    illclip_join = ':'.join((illclip_split[0],
                                             adapter_file,
                                             *(x for x in illclip_split[2:])))
                    use_kwargs['ILLUMINACLIP'] = illclip_join
                except OSError as err:
                    print('Adapter file {0} not found.'.format(adapter_file))
                    sys.exit(1)   # abort the function

        else:
            print('kwarg key, value pair: {0}: {1} not understood;\
please specify key in (case-sensitive) manner.\
Default kwargs are: {3}'.format(key,
                                value,
                                default_kwargs))
    
    # Initialize some default settings in case none specified
    for key, value in default_kwargs.items():
        if key not in use_kwargs:
            use_kwargs[key] = value
            
    print('kwarg key, value pairs used: ', use_kwargs)
        
    # Format and output the relevant commands
    fastq_files = [x for x in sorted(biopy_fastqs)]
    fastq_paths = [biopy_fastqs[x]['path'] 
                   for x in sorted(biopy_fastqs)]
    forwards = fastq_files[0::2]
    forward_paths = fastq_paths[0::2]
    reverses = fastq_files[1::2]
    reverse_paths = fastq_paths[1::2]
    
    for f, f_p, r, r_p in zip(forwards, forward_paths, reverses, reverse_paths):
        trim_commands = ['java -jar', trim_path, 'PE', use_kwargs['phred'],
                        use_kwargs['ILLUMINACLIP'], use_kwargs['LEADING'], 
                        use_kwargs['TRAILING'], use_kwargs['SLIDINGWINDOW'], 
                        use_kwargs['MINLEN']]
        os.chdir(wd)
        to_insert = [f_p, r_p, '%s_trimmed_paired.fastq' % f,
                     '%s_trimmed_unpaired.fastq' % f, 
                     '%s_trimmed_paired.fastq' % r, 
                     '%s_trimmed_unpaired.fastq' % r]
        for i, thing in enumerate(to_insert):
            trim_commands.insert(3+i, thing)

        trim_commands = ' '.join(trim_commands)
        commands.append(trim_commands)
        
    commands = sep.join(commands)
        
    return commands
In [27]:
str.join?
In [28]:
commands = trimmomatic_paired(biopy_fastqs, mac_pc='mac')
commands
kwarg key, value pairs used:  {'phred': '-phred33', 'ILLUMINACLIP': '', 'LEADING': 'LEADING:25', 'TRAILING': 'TRAILING:25', 'SLIDINGWINDOW': '', 'MINLEN': 'MINLEN:35'}
Out[28]:
'java -jar /Users/drew/Documents/Data/Bio/Trimmomatic-0.36/trimmomatic-0.36.jar PE C:\\Users\\DMacKellar\\Documents\\Python\\BioPython\\Galaxy_rnaseq\\Galaxy2-[adrenal_1.fastq].fastqsanger C:\\Users\\DMacKellar\\Documents\\Python\\BioPython\\Galaxy_rnaseq\\Galaxy3-[adrenal_2.fastq].fastqsanger Galaxy2-[adrenal_1]_trimmed_paired.fastq Galaxy2-[adrenal_1]_trimmed_unpaired.fastq Galaxy3-[adrenal_2]_trimmed_paired.fastq Galaxy3-[adrenal_2]_trimmed_unpaired.fastq -phred33  LEADING:25 TRAILING:25  MINLEN:35;java -jar /Users/drew/Documents/Data/Bio/Trimmomatic-0.36/trimmomatic-0.36.jar PE C:\\Users\\DMacKellar\\Documents\\Python\\BioPython\\Galaxy_rnaseq\\Galaxy4-[brain_1.fastq].fastqsanger C:\\Users\\DMacKellar\\Documents\\Python\\BioPython\\Galaxy_rnaseq\\Galaxy5-[brain_2.fastq].fastqsanger Galaxy4-[brain_1]_trimmed_paired.fastq Galaxy4-[brain_1]_trimmed_unpaired.fastq Galaxy5-[brain_2]_trimmed_paired.fastq Galaxy5-[brain_2]_trimmed_unpaired.fastq -phred33  LEADING:25 TRAILING:25  MINLEN:35'

I copied the command above to the terminal on my Mac. It ran quickly, and output:

/Users/drew/Documents/Data/Python/Galaxy_rnaseq/Galaxy2-[adrenal_1.fastq].fastqsanger /Users/drew/Documents/Data/Python/Galaxy_rnaseq/Galaxy3-[adrenal_2.fastq].fastqsanger Galaxy2-[adrenal_1]_trimmed_paired.fastq Galaxy2-[adrenal_1]_trimmed_unpaired.fastq Galaxy3-[adrenal_2]_trimmed_paired.fastq Galaxy3-[adrenal_2]_trimmed_unpaired.fastq -phred33 LEADING:25 TRAILING:25 MINLEN:35
Multiple cores found: Using 2 threads
Input Read Pairs: 50121 Both Surviving: 45162 (90.11%) Forward Only Surviving: 3074 (6.13%) Reverse Only Surviving: 1163 (2.32%) Dropped: 722 (1.44%)
TrimmomaticPE: Completed successfully
TrimmomaticPE: Started with arguments:
 /Users/drew/Documents/Data/Python/Galaxy_rnaseq/Galaxy4-[brain_1.fastq].fastqsanger /Users/drew/Documents/Data/Python/Galaxy_rnaseq/Galaxy5-[brain_2.fastq].fastqsanger Galaxy4-[brain_1]_trimmed_paired.fastq Galaxy4-[brain_1]_trimmed_unpaired.fastq Galaxy5-[brain_2]_trimmed_paired.fastq Galaxy5-[brain_2]_trimmed_unpaired.fastq -phred33 LEADING:25 TRAILING:25 MINLEN:35
Multiple cores found: Using 2 threads
Input Read Pairs: 37992 Both Surviving: 29529 (77.72%) Forward Only Surviving: 2097 (5.52%) Reverse Only Surviving: 5045 (13.28%) Dropped: 1321 (3.48%)
TrimmomaticPE: Completed successfully

When I try copy-paste-running the above on the PC, however, I get the error:

Error: Invalid or corrupt jarfile /usr/bin/TrimmomaticPE

After Googling, I came across a post from my old collaborator Tony Bolger, saying that I need to add something about a partial URL of the source from which the program was obtained; something that I seem to remember seeing in my notes while I was at Harvard:

java -classpath <path to trimmomatic jar> org.usadellab.trimmomatic.TrimmomaticPE

Unfortunately, it seems that the executable at '/usr/bin/TrimmomaticPE' isn't the right file to pass in place of the '<path to trimmomatic jar>' placeholder in the command above. Since I used 'apt-get install trimmomatic' to get the program on my PC, I didn't know the actual location where that jar file could be found. So I went to root and used find:

cd /
sudo find . -name "*trimmomatic*jar"

That took a while to run because it scanned through all of the Windows-specific dirs, outputting 'permission denied' for tons of dirs, because the Ubuntu-specific password doesn't work for the Windows-controlled sectors of the hard drive. Anyways, it eventually found the jar file in:

./usr/share/java/trimmomatic-0.35.jar

But when I tried running that, it wanted the commands in a sufficiently different format that it would've been a pain in the ass to reprogram the whole function. Instead, I just chose to uninstall the distro I got with 'apt-get install', and grab the zip file from the same source as on the Mac. I put the zip file under the home dir in the Ubuntu structure; the jar will be at '~/Documents/BioPython/Trimmomatic-0.36/trimmomatic-0.36.jar'. Now the command formatting should be more compatible.

...Aaaand, the hell with that. Running this through the Ubuntu interface requires reformatting all of the 'path' values in the dict, from Windows-friendly to linux-friendly. I could do that with a couple of str.replace() calls, but realized that Trimmomatic should work just fine under native windows, too. Let's try that, instead. I grabbed the zip file from the same source as on the Mac. I put the zip file in my Documents structure; the jar will be at 'C:\Users\DMacKellar\Documents\Python\BioPython\Trimmomatic-0.36\trimmomatic-0.36.jar'.

Ok. After a lot of fussing, I finally got the function to format the output correctly to function on the PC. Of course, when I go back to the Mac, I'll have to make sure that I didn't screw anything up too badly and it still functions there.

Not surprisingly, now that I reconsider the code, when I ran on the Mac, the output files went to /Users/drew/Documents/Data/Bio/FastQC, since that was the working dir of the terminal when I ran the commands. I had the function above change the working dir to that in which the input fastq files were found, but since I had written the function to just write out commands rather than executing them, changing the dir within python had no effect.

A better idea would be to run the function within the Python script, using the subprocess module. Anyways, for now, I'll point the fastq parser and plotting functions at those files and visualize the output.

...That returns:

ValueError: all the input array dimensions except for the concatenation axis must match exactly

presumably because the 'np.vstack().T' calls within the 'biopy_fastq' function expect arrays all of the same size. Since Trimmomatic and other trimming tools cut out sequence content on the basis of individual base-call quality scores, the reads are now a distribution of many different sizes. I'll have to modify the function to pad the reads with 'NaN' values to all be the same length. These should be added just to the 3' end of each read, only used for manipulation within numpy, and should be removed again if I want to output the reads for any further processing outside of Python.

It sounds like 'np.pad()' is the way to do this, but I'm having a bit of trouble getting it to do exactly what I want: namely, pad variable-length rows to a common length, pad with the value 'np.nan', and only add to the right-hand side. This post might have some insight to offer.


Note: in the course of trying to alter the parse_fastq() function to address this issue, I commented out lines using the np.vstack() method and the any subsequent references to 'per_base' metrics, then ran the function. Surprisingly, on the PC that returns the aforementioned IOPub data rate error seen on the Mac. I'll try to address this using the same modification of the Jupyter config file as I did then.

When I try:

jupyter notebook --generate-config<br /><br />

on the PC, however, Windows opens a window asking what program I want to use to open this file. In other words, the file already existed. I chose Notepad++ to handle it, and the window indicates that the file opened is:

C:\Python\Lib\site-packages\jupyter.py<br /><br />

Its contents prior to modification are minimal:

"""Launch the root jupyter command"""<br />
if __name__ == '__main__':<br />
    from jupyter_core.command import main<br />
    main()<br /><br />

I saved the unmodified config file as:

C:\Python\Lib\site-packages\jupyter_bkup.py<br /><br />

I added the line:

c.NotebookApp.iopub_data_rate_limit = 10000000<br /><br />

at line 6, saved, and will reload the notebook server. The notebook reloaded fine, but still returned the iopub data rate warning. It seems that the two most likely explanations are that the alteration didn't take effect or that the increase in the data rate was insufficient. To test the first possibility, I'll reload the notebook using the argument from the command line, as the warning suggests, rather than relying upon the config file modification:

jupyter-notebook --NotebookApp.iopub_data_rate_limit=10000000<br /><br />

That does cause the cell to complete running without returning an error/warning, so it appears that the modification to the config file didn't have the desired effect. In order to ensure that, however, I reloaded the notebook server without specifying the command-line arg expansion of the data rate limit, and this time it does evaluate the cell without protest, so perhaps the first time I tried reloading after editing the config file was an anomaly, not the rule. For now, I will leave this issue, but should try to remember this episode if I encounter this warning again.</font>


In [29]:
from Bio import SeqIO
import pandas as pd
import os
import numpy as np
from collections import defaultdict

start_dir_PC = r'C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq'
start_dir_Mac = '/Users/drew/Documents/Data/Python/Galaxy_rnaseq/'

def parse_fastq(start_dir):
    biopy_fastqs = {}
    phred_scale = 39   # Assuming Illumina data; change if diff qual score scale
    for file in os.listdir(start_dir):
        if file.split(sep='.')[-1] in ['fastqsanger', 'fastq']:
            name = file.split(sep='.')[0]
            file_path = os.path.join(start_dir, file)
            biopy_fastqs[name] = {}
            biopy_fastqs[name].update({'path': os.path.join(start_dir, file),
                                       'seqs': []})
            for record in SeqIO.parse(file_path, "fastq"):
                biopy_fastqs[name]['seqs'].append(
                    {'id': record.id, 
                     'sequence': np.array(list(str(record.seq))),
                     'seq_whole': str(record.seq),
                     'quality': np.array(record.letter_annotations['phred_quality'])})
               
                
            biopy_fastqs[name]['df'] = pd.DataFrame.from_dict(biopy_fastqs[name]['seqs'])
#             biopy_fastqs[name]['per_base_qual'] = np.vstack(biopy_fastqs[name]['df']['quality']).T
#             biopy_fastqs[name]['per_base_seq'] = np.vstack(biopy_fastqs[name]['df']['sequence']).T
            
            # Count base calls to dicts; one for raw counts, one for percentages:
            biopy_fastqs[name]['per_base_counts'] = []
            biopy_fastqs[name]['per_base_percents'] = []
            
#             for i, row in enumerate(biopy_fastqs[name]['per_base_seq']):
#                 biopy_fastqs[name]['per_base_counts'].append(defaultdict())
#                 unique, counts = np.unique(row, return_counts=True)
#                 biopy_fastqs[name]['per_base_counts'][i] = dict(zip(unique, counts))
                
#             for i, dictionary in enumerate(biopy_fastqs[name]['per_base_counts']):
#                 biopy_fastqs[name]['per_base_percents'].append(defaultdict())
#                 for k in dictionary.keys():
#                     biopy_fastqs[name]['per_base_percents'][i][k] = dictionary[k] / sum(dictionary.values())
                        
            # Let's put the counts into a 'per_base_df':
#             biopy_fastqs[name]['per_base_seq_df'] = pd.DataFrame.from_dict(
#                 biopy_fastqs[name]['per_base_counts']).fillna(value=0)
#             biopy_fastqs[name]['per_base_percents_df'] = pd.DataFrame.from_dict(
#                 biopy_fastqs[name]['per_base_percents']).fillna(value=0)
#             biopy_fastqs[name]['per_base_qual_df'] = pd.DataFrame(
#                 biopy_fastqs[name]['per_base_qual']).fillna(value=0)
            
            # add values for duplication level
            biopy_fastqs[name]['duplicates'] = biopy_fastqs[name]['df'].groupby(['seq_whole']).size()\
            .value_counts().sort_index()

            # add mean per_seq_qual
            biopy_fastqs[name]['mean_seq_qual'] = pd.cut(biopy_fastqs[name]['df']['quality'] \
                                                       .apply(np.mean), range(1, 41)).value_counts().sort_index()
            biopy_fastqs[name]['mean_seq_qual'].index = np.arange(2, phred_scale + 2)

            
    
    return biopy_fastqs
In [30]:
# biopy_fastqs = parse_fastq(start_dir_Mac)
biopy_fastqs = parse_fastq(start_dir_PC)

for x in biopy_fastqs:
    print(x)
Galaxy2-[adrenal_1
Galaxy3-[adrenal_2
Galaxy4-[brain_1
Galaxy5-[brain_2
In [31]:
import matplotlib.pyplot as plt

for x in biopy_fastqs['Galaxy2-[adrenal_1']: print(x)
    
biopy_fastqs['Galaxy2-[adrenal_1']['df']['seq_whole'].str.len().plot(kind='hist', bins=20)
plt.show()
path
seqs
df
per_base_counts
per_base_percents
duplicates
mean_seq_qual

Ok, so now I've gotten numpy/pandas to load data with different-length reads. The 'np.vstack()' method doesn't work with these, but I can try to play around with the np.pad() method.

In [32]:
blah = biopy_fastqs['Galaxy2-[adrenal_1']['df']['seq_whole'].apply(np.vstack)

print('{}\n{}'.format(blah.shape, blah.T.shape))
(50121,)
(50121,)

Ok, so np.vstack() actually works on the 'seq_whole' data, but not on quality scores or 'sequence' itself. Presumably this is because every entry in the 'seq_whole' series is an array with the same shape as every other: they have 1 element, a string, although the strings themselves have different lengths. By comparison, the 'sequence' and 'quality' series have arrays of different lengths.

So this would appear to be exactly where np.pad() could help.

...But I can't get it to pad all the elements to a common width. Instead, it seems that the pad function is only used to add a set number of elements to every element over which it iterates. Instead, I'll have to use a for loop to achieve what I want.

For some reason, whenever I call on pandas to describe the series of 'quality' data for galaxy2_trimmed_paired, it causes the notebook to hang up. Is this really beyond the memory capacity of the notebook, or my PC?

Calling the describe() method on up to 1000 rows works, although noticeably slower than for 100 rows. 10,000 rows doesn't seem like it will complete.

On the Mac, describing 100 rows takes about 161ms; 1,000 takes 12.4 seconds. Obviously this dataset is difficult for either laptop to work with.


NOTE: Since the parse_fastq() function below has been modified to interpret trimmed reads, and since trimmed reads aren't compatible with the plot_fastq() function I'd written, I had to 'pad' the sequence and quality scores, and store those as separate dict and Pandas DataFrame values.

The way that I have padded them is dependent upon the assumption that these are only meant for the interpretation of per-base and per-sequence stats within notebooks like this one, and that they won't be carried forward for writing new fastq files for use with other software in the analysis pipeline, which won't accept exotic values.

I padded the quality scores with 'np.nan', so that they wouldn't interfere with calculating stats about whole reads this also avoids any ambiguity about whether they came from the original quality scores assigned by the sequencer, or from the artificial process of padding them with this function for the purpose of internal analysis. But since the sequence values of 'N' are valid outputs from the sequencer, to avoid any ambiguity about whether a base call comes from that source or was introduced by padding here, I've chosen to alter the parse_fastq() function to pad with the base letter of 'Z'.

IF the output of any of these dicts will be carried forward using the padded values introduced here, either the parse_fastq() function should be modified again, perhaps to introduce '0' and 'N' values for 'np.nan' and 'Z', respectively, or else a downstream function should convert these values prior to writing to a new fastq file. Since I've chosen to keep the 'quality' and 'sequence' values separate in the dict from their respective 'quality_pad' and 'sequence_pad' values, however, I'm hoping this will prove unnecessary.</font>

In [33]:
from Bio import SeqIO
import pandas as pd
import os
import numpy as np
from collections import defaultdict

start_dir_PC = r'C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq'
start_dir_Mac = '/Users/drew/Documents/Data/Python/Galaxy_rnaseq/'


def parse_fastq(start_dir, read_length=50):
    biopy_fastqs = {}
    phred_scale = 39   # Assuming Illumina data; change if diff qual score scale
    for file in os.listdir(start_dir):
        if file.split(sep='.')[-1] in ['fastqsanger', 'fastq']:
            name = file.split(sep='.')[0]
            file_path = os.path.join(start_dir, file)
            biopy_fastqs[name] = {}
            biopy_fastqs[name].update({'path': os.path.join(start_dir, file),
                                       'directory': start_dir,
                                       'seqs': []})
            for record in SeqIO.parse(file_path, "fastq"):
                quality = np.array(record.letter_annotations['phred_quality'])
                quality_pad = np.append(quality, [np.nan]*(read_length-len(quality)))
                sequence = np.array(list(str(record.seq)))
                sequence_pad = np.append(sequence, ['Z']*(read_length-len(sequence)))
                biopy_fastqs[name]['seqs'].append({'id': record.id, 
                                                   'sequence': sequence,
                                                   'sequence_pad': sequence_pad,
                                                   'seq_whole': str(record.seq),
                                                   'quality': quality,
                                                   'quality_pad': quality_pad,
                                                   'description' : record.description,
                                                   'name': record.name})
               

            biopy_fastqs[name]['df'] = pd.DataFrame.from_dict(biopy_fastqs[name]['seqs'])
            
            # Can use the np.vstack() method on quality scores and sequences
            # Once we've brought all axes up to the same, standard length
            biopy_fastqs[name]['per_base_qual'] = np.vstack(biopy_fastqs[name]['df']['quality_pad']).T
            biopy_fastqs[name]['per_base_seq'] = np.vstack(biopy_fastqs[name]['df']['sequence_pad']).T
            
            # Count base calls to dicts; one for raw counts, one for percentages:
            biopy_fastqs[name]['per_base_counts'] = []
            biopy_fastqs[name]['per_base_percents'] = []
            
            for i, row in enumerate(biopy_fastqs[name]['per_base_seq']):
                biopy_fastqs[name]['per_base_counts'].append(defaultdict())
                unique, counts = np.unique(row, return_counts=True)
                biopy_fastqs[name]['per_base_counts'][i] = dict(zip(unique, counts))
                
            for i, dictionary in enumerate(biopy_fastqs[name]['per_base_counts']):
                biopy_fastqs[name]['per_base_percents'].append(defaultdict())
                for k in dictionary.keys():
                    biopy_fastqs[name]['per_base_percents'][i][k] = dictionary[k] / sum(dictionary.values())
                        
            # Let's put the counts into a 'per_base_df':
            biopy_fastqs[name]['per_base_seq_df'] = pd.DataFrame.from_dict(
                biopy_fastqs[name]['per_base_counts'])#.fillna(value=0)
            biopy_fastqs[name]['per_base_percents_df'] = pd.DataFrame.from_dict(
                biopy_fastqs[name]['per_base_percents'])#.fillna(value=0)
            biopy_fastqs[name]['per_base_qual_df'] = pd.DataFrame(
                biopy_fastqs[name]['per_base_qual'])#.fillna(value=0)
            
            # add values for duplication level
            biopy_fastqs[name]['duplicates'] = biopy_fastqs[name]['df'].groupby(['seq_whole']).size()\
            .value_counts().sort_index()

            # add mean per_seq_qual
            biopy_fastqs[name]['mean_seq_qual'] = pd.cut(biopy_fastqs[name]['df']['quality'] \
                                                       .apply(np.mean), range(1, 41)).value_counts().sort_index()
            biopy_fastqs[name]['mean_seq_qual'].index = np.arange(2, phred_scale + 2)
            
            print('Added fastq file {0}'.format(name))

    print('')        
    print('Finished parsing directory.')
    return biopy_fastqs
In [34]:
biopy_fastqs = parse_fastq(start_dir_PC)
# biopy_fastqs = parse_fastq(start_dir_Mac)
Added fastq file Galaxy2-[adrenal_1
Added fastq file Galaxy3-[adrenal_2
Added fastq file Galaxy4-[brain_1
Added fastq file Galaxy5-[brain_2

Finished parsing directory.

Note: Up to this point in this notebook, I've only read in FastQ files and used BioPython to summarize their quality and modify their data without writing those data out to a file, as well as controlling Trimmomatic, which handles writing FastQ files itself. When I try to write out FastQ files below (after removing reads whose mean quality scores were below a threshold), I found out that BioPython's SeqIO module is built around the assumption that any in-Python manipulations will be relatively short blocks of code, and writing out would be relatively simple because the record (i.e., sequencing read)'s data will have remained associated with all of its relevant parts via the SeqIO module.

Since I've adopted this whole approach of using a parse_fastq() function, I've created a complicated dict that loses that association (but gained the advantage that all of the relevant data associated with the sequence object have been broken out into more familiar Python types). To write back to FastQ files, I can think of three main options, each with specific pros & cons:

  • Add the SeqIO record as a specific object associated with each 'seq' within each 'dict' output by the parse_fastq() function. This is the simplest to implement, but I'm not sure if it will greatly inflate the memory usage associated with the dict, and the respective time it would take to process the data in downstream operations. I'm thinking it might be significant, since it'd be essentially duplicating much of the other info in the dict.

  • Use any data manipulation from the parse_fastq()-output dict to just generate a list of sequence IDs, then refer back to the original input FastQ file to grab those IDs and write them out to a new file. That's less simple than the first option to implement, but would probably be more memory-efficient. A problem is that it would lose any potential modifications I make to any particular sequence's data within Python. That hasn't really come up yet; I've trimmed sequences with Trimmomatic, and padded them back out just to be able to more easily get summary data, but if I ever wanted to save any trim/pad info in the future, I'd have to find another workaround.

  • Ignore the intricacies of using the SeqIO module to write out, and just make new files with a custom format along the lines indicated at the start of this notebook: FastQ files are inherently ['seq1_id', '+', 'seq1_qual'] repeated over however many lines needed. I can certainly force that with the dicts I have (although some of the ID info was truncated, including the trailing '/1' / '/2' characters indicating which file is which read. Alternatively, as a modification of this approach, I could perhaps use the SeqIO module by setting the dict-filtered data for each read to create new SeqIO 'record' objects, then write these out to files, in case any intricacies of formatting that I haven't noticed are important.

I'll try the last approach first; even though it's the most involved, it is the most flexible.

In [35]:
import Bio.SeqIO
import datetime
import errno
import os


def make_subdirs(fastq_dict, out_dir_suffix=None):
    # A function to create new subdirs, if needed, 
    # using datetime if no particular name is specified:
    name_dir = {}
    
    if out_dir_suffix != None:
        suffix = str(out_dir_suffix)
    elif out_dir_suffix == None:
        suffix = datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
    
    # Get all of the directories 
    # associated with files in the dict
    dirs = set()
    for fastq_name in fastq_dict.keys():
        dir_temp = fastq_dict[fastq_name]['directory']
        dirs.add(dir_temp)
        name_dir[fastq_name] = dir_temp
        
    # Make name for new subdirs
    for directory in dirs:
        new_dir = os.path.join(directory, suffix)
        # Associate FastQ files with new subdir name
        for fastq_name in name_dir:
            if name_dir[fastq_name] == directory:
                name_dir[fastq_name] = new_dir
        # Write the directories
        try:
            os.makedirs(new_dir)
            print('Wrote {}'.format(new_dir))
        except OSError as e:
            if e.errno != errno.EEXIST:
                raise
    
    return name_dir


def write_fastq(fastq_dict, out_dir=None, out_dir_suffix=None):
    # Initialize a dict to store assoc between
    # FastQ files and the dirs they go to
    name_dir = {}

    # If user-specified outdir given
    # (i.e., one dir to put all files in):
    if out_dir != None:
        for fastq_name in fastq_dict:
            name_dir[fastq_name] = out_dir
            
        # If the dir doesn't already exist, make it
        if os.path.isdir(out_dir) == False:
            try:
                os.makedirs(out_dir)
            except OSError as e:
                if e.errno != errno.EEXIST:
                    raise
            
    # If out_dir_suffix given (i.e., fastq_dict contains
    # files in various dirs, and user wants a specific name
    # associated with each subdir):
    elif out_dir == None:
        name_dir = make_subdirs(fastq_dict,
                                out_dir_suffix=out_dir_suffix)

    # Or: I'm not using args correctly
    else:
        print('Interpretation error in arg "out_dir" or "out_dir_suffix".')
    
    # Now, iterate through the fastq files in the dict
    # write them to new SeqRecord objects, and write to files
    for fastq_name in sorted(fastq_dict):
        records = []
        df = fastq_dict[fastq_name]['df']
        for index, row in df.iterrows():
            record = SeqIO.SeqRecord(seq=row['seq_whole'],
                                     id=row['description'],
                                     letter_annotations={'phred_quality': row['quality']},
                                     description='')
            records.append(record)

        filename = os.path.join(name_dir[fastq_name],
                                fastq_name+'.fastq')
        with open(filename, 'w+') as f:
            SeqIO.write(records, handle=f, format='fastq')
        
        print('')    
        print('{0} written to {1}'.format(fastq_name, filename))
        
    print('')
    print('Finished writing dict to fastq.')
In [36]:
dummy_dir_PC = r'C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\dummy'

# write_fastq(biopy_fastqs, out_dir=dummy_dir_PC)

# write_fastq(biopy_fastqs, out_dir_suffix='blah')

# write_fastq(biopy_fastqs)

Ok, after a LOT of fiddling, I got that to work the way I wanted. At first I had tried referring to the dict entries for each seq to populate the output files, but indexing nested dicts is tricky, and finally I found it easier to grab the entries from the Pandas DataFrame instead.

I noticed that in the output fastq files, the formatting looked pretty OK, except that the first entry was like this:

@ERR030881.107 <unknown description>
CGGATTTCAGCTACTGCAAGCTCAGTACCACAGCCTCAAGCTCGAATGTG
+
HH;HHHHHGHHHHHHHHHHGHDHEHHHHHEHHHHBHHFHHHHHHHHHD0F

Whereas in the original Galaxy3-[adrenal_2.fastq].fastqsanger file it was:

@ERR030881.107 HWI-BRUNOP16X_0001:2:1:13663:1096#0/2
CGGATTTCAGCTACTGCAAGCTCAGTACCACAGCCTCAAGCTCGAATGTG
+
HH;HHHHHGHHHHHHHHHHGHDHEHHHHHEHHHHBHHFHHHHHHHHHD0F

So it seems that the relevant parameter in which to save that additional info is the 'description' attribute for the SeqIO.SeqRecord object. Therefore I should be able to write that into the original parse_fastq() function and keep that info for each read (the info about the Illumina machine on which it was read is nonessential, but the info that keeps it paired with its opposite in another file is pretty relevant). I'll try making that change to the last instance of the parse_fastq() function above.

...That pretty much works, except that apparently BioPython's behavior with FastQ files is to parse the first line for each read with the stuff before the space being the 'record.id' feature, and the ENTIRE line as the 'record.description' feature. There's an additional feature it can grab from this row, called 'record.name', but that apparently ends up as the same as 'record.id'. So if I ask the fastq_write() function to write 'name' AND 'description', it'd end up just repeating the id in the same line. Instead, I'll just set the function to write the description as the name.

That worked, but the danged SeqIO still tags '<unknown description>' onto the end:

@ERR030881.107 HWI-BRUNOP16X_0001:2:1:13663:1096#0/1 <unknown description>
ATCTTTTGTGGCTACAGTAAGTTCAATCTGAAGTCAAAACCAACCAATTT
+
5.544,444344555CC?CAEF@EEFFFFFFFFFFFFFFFFFEFFFEFFF

I'll modify the 'write_fastq()' function again to explicitly declare that the record.description should be None, or an empty string or something. Ok, None didn't work, but an empty string did. It didn't add any trailing whitespace to the line, and since a fastq file has limited formatting nuance, I would expect subsequent programs to treat it as legit.

Ok, I'm sure this could be improved upon, but that's not bad. I've finally hit the goal of automating the gathering and parsing of all fastq files in a dir, and plotting their summary characteristics in a manner that scales with the number of files. Actually, I'm noticing a flaw with the duplication level of the last file; the 'trimmed_unpaired' (last plot) doesn't have nearly as many reads, but the plotter still stretches the y axis to occupy the full space, even though I tried to specify that it should share the same y axis throughout the row. Eh, anyways, for now I'll skip running that down, and get down to optimizing trim parameters (I should probably add a function that summarizes read counts after trimming, and maybe build this into the trimmomatic caller), and run some small-scale alignments before submitting a big job to use the full dataset.


Actually, I just realized that I never applied a mean sequence quality cutoff prior to using Trimmomatic. I'll write a short function now to do that for all fastq files in a dict of parsed fastq files, so that you can get rid of any particularly low-quality reads, either before or after trimming based on per-base call quality scores.

Note: In order to get proper formatting for the output info about how many reads were dropped, I had to make extensive use of the 'print(''.format)' functionality in Python.

Also, in the course of making the write_fastq() function, I ended up deleting the trimmed fastq files in the source dir, so I'll regenerate those now using the trimmomatic_paired() function.

ACTUALLY, I just noticed from the plots above that the wonky ACGT percentages for the first 10 bases of each file are far too similar. Check out, specifically, how 'A' is elevated for positions ~2-5 and depressed for positions ~6-8 for all libraries, even after trimming. This suggests strongly some adapter sequences that didn't get trimmed. So I'll try adding parameters to the trimmomatic function and the corresponding call to look for these. The Trimmomatic docs state that 'TruSeq3' is the appropriate choice of FASTA input for all HiSeq and MiSeq runs, and the docs for the source reads states they were HiSeq.

But that doesn't find the adapters fasta file. I'll have to add more code to provide for the path to those files. These did indeed come with the install zip; on the PC they're at:

C:\Users\DMacKellar\Documents\Python\BioPython\Trimmomatic-0.36\adapters

...I tried adding that full path into the commands, but the problem is that the Trimmomatic tool interprets colons as separating important inputs within the ILLUMINACLIP command, so it ends up taking 'C:' as the fasta file with the adapter sequences. As a workaround, I'll have the program execute within the dir containing the adapters sequences.

In [37]:
import sys
import os
import subprocess


def trimmomatic_paired(fastq_dict, mac_pc='mac', subdir=True, 
                       subdir_suffix=None, verbose=False, **kwargs):
    
    # Set the path for trimmomatic based on which laptop I'm using
    rand_entry = list(fastq_dict.keys())[0]
    if mac_pc == 'mac':
        trim_path = '/Users/drew/Documents/Data/Bio/\
Trimmomatic-0.36/trimmomatic-0.36.jar'
        commands = []
        sep = ';'
    elif mac_pc == 'pc':
        trim_path = r'C:\Users\DMacKellar\Documents\Python\\
BioPython\Trimmomatic-0.36\trimmomatic-0.36.jar'
        commands = []
        sep = ' & '
    else:
        print('Please specify "mac_pc" arg as "mac" or "pc"')
    
    adapter_path = os.path.join(os.path.dirname(trim_path), 'adapters')
    os.chdir(adapter_path)
    
    # Confirm that the trimmomatic jar file exists
    try:
        os.path.isfile(trim_path)
    except:
        print('trimmomatic.jar file not found at expected location')
        sys.exit(1)   # abort the function
    
    # Confirm that the input dict 
    # contains an even number of fastq files
    try:
        len(fastq_dict) % 2 != 0
    except:
        print('Input has odd number of fastq files; \
\nPlease enter paired end data.')
        sys.exit(1)   # abort the function
        
    use_kwargs = {}
    
    # Specify valid kwarg keys and default values
    default_kwargs = {
        'phred': '-phred33',
        'ILLUMINACLIP': '',
        'LEADING': 'LEADING:25',
        'TRAILING': 'TRAILING:25',
        'SLIDINGWINDOW': '',
        'MINLEN': 'MINLEN:30'
    }
    
    # For any input kwargs, pass them to the use_kwargs dict
    for key, value in kwargs.items():
        if key in default_kwargs:
            use_kwargs[key] = value
            # If adapter trimming requested,
            # Confirm location of adapter fasta file
            if (key=='ILLUMINACLIP') & (value != ''):
                illclip_split = value.split(sep=':')
                adapter_file = os.path.join(adapter_path,
                                            illclip_split[1])
                try:
                    os.path.isfile(adapter_file)
                except OSError as err:
                    print('Adapter file {0} not found.'.format(adapter_file))
                    sys.exit(1)   # abort the function

        else:
            print('kwarg key, value pair: {0}: {1} not understood;\
please specify key in a case-sensitive manner.\
Default kwargs are: {3}'.format(key,
                                value,
                                **default_kwargs))
    
    # For any kwargs not fed to input, use defaults
    for key, value in default_kwargs.items():
        if key not in use_kwargs:
            use_kwargs[key] = value
            
    print('kwarg key, value pairs used: ', use_kwargs)
    
    name_dir = {}
    if subdir == True:
        if subdir_suffix != None:
            # Use user-defined subdir name
            name_dir = make_subdirs(fastq_dict,
                                out_dir_suffix=str(subdir_suffix))
        elif subdir_suffix == None:
            # Make new subdir to put files in
            name_dir = make_subdirs(fastq_dict,
                                    out_dir_suffix='trim_L{0}_T{1}'
                                    .format(use_kwargs['LEADING'].split(sep=':')[1],
                                            use_kwargs['TRAILING'].split(sep=':')[1]))
    elif subdir == False:
        for fastq_name in fastq_dict:
            name_dir[fastq_name] = fastq_dict[fastq_name]['directory']
        
    # Format and output the relevant commands
    fastq_paths = [fastq_dict[x]['path'] for x in sorted(fastq_dict)]
    output_paths = [os.path.join(name_dir[x], x) for x in sorted(fastq_dict)]
    forward_paths = fastq_paths[0::2]
    reverse_paths = fastq_paths[1::2]
    out_f_paths = output_paths[0::2]
    out_r_paths = output_paths[1::2]
    
    for f_p, r_p, f_outp, r_outp in zip(forward_paths, reverse_paths,
                                        out_f_paths, out_r_paths):
        trim_commands = ['java -jar', trim_path, 'PE', use_kwargs['phred'],
                        use_kwargs['ILLUMINACLIP'], use_kwargs['LEADING'], 
                        use_kwargs['TRAILING'], use_kwargs['SLIDINGWINDOW'], 
                        use_kwargs['MINLEN']]

        to_insert = [f_p, r_p, '%s_trimmed_paired.fastq' % f_outp,
                     '%s_trimmed_unpaired.fastq' % f_outp,
                     '%s_trimmed_paired.fastq' % r_outp,
                     '%s_trimmed_unpaired.fastq' % r_outp]
        for i, thing in enumerate(to_insert):
            trim_commands.insert(4+i, thing)

        trim_commands = ' '.join(trim_commands)
        commands.append(trim_commands)
        
    commands = sep.join(commands)
    stdout = subprocess.run(commands, stdout=subprocess.PIPE, 
                   stderr=subprocess.STDOUT, shell=True,
                   check=True)
    if verbose==True:
        for x in stdout.stdout.splitlines():
            print(x.decode('utf-8'))
#     return commands
    return stdout
In [38]:
trim_kwargs = {'ILLUMINACLIP': 'ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10'}

stdout = trimmomatic_paired(biopy_fastqs,
                            mac_pc='pc',
                            verbose=True,
                            subdir=True,
                            **trim_kwargs)
kwarg key, value pairs used:  {'ILLUMINACLIP': 'ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10', 'phred': '-phred33', 'LEADING': 'LEADING:25', 'TRAILING': 'TRAILING:25', 'SLIDINGWINDOW': '', 'MINLEN': 'MINLEN:30'}
TrimmomaticPE: Started with arguments:
 -phred33 C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\Galaxy2-[adrenal_1.fastq].fastqsanger C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\Galaxy3-[adrenal_2.fastq].fastqsanger C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\trim_L25_T25\Galaxy2-[adrenal_1_trimmed_paired.fastq C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\trim_L25_T25\Galaxy2-[adrenal_1_trimmed_unpaired.fastq C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\trim_L25_T25\Galaxy3-[adrenal_2_trimmed_paired.fastq C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\trim_L25_T25\Galaxy3-[adrenal_2_trimmed_unpaired.fastq ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 LEADING:25 TRAILING:25 MINLEN:30
Multiple cores found: Using 4 threads
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Using Long Clipping Sequence: 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 50121 Both Surviving: 46402 (92.58%) Forward Only Surviving: 2443 (4.87%) Reverse Only Surviving: 788 (1.57%) Dropped: 488 (0.97%)
TrimmomaticPE: Completed successfully
TrimmomaticPE: Started with arguments:
 -phred33 C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\Galaxy4-[brain_1.fastq].fastqsanger C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\Galaxy5-[brain_2.fastq].fastqsanger C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\trim_L25_T25\Galaxy4-[brain_1_trimmed_paired.fastq C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\trim_L25_T25\Galaxy4-[brain_1_trimmed_unpaired.fastq C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\trim_L25_T25\Galaxy5-[brain_2_trimmed_paired.fastq C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\trim_L25_T25\Galaxy5-[brain_2_trimmed_unpaired.fastq ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 LEADING:25 TRAILING:25 MINLEN:30
Multiple cores found: Using 4 threads
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Using Long Clipping Sequence: 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 37992 Both Surviving: 30847 (81.19%) Forward Only Surviving: 1750 (4.61%) Reverse Only Surviving: 4465 (11.75%) Dropped: 930 (2.45%)
TrimmomaticPE: Completed successfully
In [39]:
trim_dir = r'C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\trim_L25_T25'
trim1_dict = parse_fastq(trim_dir)
Added fastq file Galaxy2-[adrenal_1_trimmed_paired
Added fastq file Galaxy2-[adrenal_1_trimmed_unpaired
Added fastq file Galaxy3-[adrenal_2_trimmed_paired
Added fastq file Galaxy3-[adrenal_2_trimmed_unpaired
Added fastq file Galaxy4-[brain_1_trimmed_paired
Added fastq file Galaxy4-[brain_1_trimmed_unpaired
Added fastq file Galaxy5-[brain_2_trimmed_paired
Added fastq file Galaxy5-[brain_2_trimmed_unpaired

Finished parsing directory.

Hmm... Now Trimmomatic is returning flawed sequences. I've re-run the trim module four times, and whenever I try to parse the output, BioPython returns an error; thrice it was about differing lengths for the respective sequence and quality scores for some read. The read in question varies every time, and checking them in the fastq file indeed reveals that BioPython is correct. One time the parser returned 'whitespace is not allowed in the sequence'. I really don't know why I'm seeing this; I didn't have this problem the first time I wrote the 'trimmomatic_paired' function.

I'll try running manually from the command line to see whether there's some wrinkle being introduced via running from the notebook:

# (on PC):
cd C:\Users\DMacKellar\Documents\Python\BioPython\Trimmomatic-0.36\adapters
java -jar C:\Users\DMacKellar\Documents\Python\BioPython\Trimmomatic-0.36\trimmomatic-0.36.jar PE -phred33 C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\Galaxy2-[adrenal_1.fastq].fastqsanger C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\Galaxy3-[adrenal_2.fastq].fastqsanger C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\trim_L25_T25\output_forward_paired.fastq C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\trim_L25_T25\output_forward_unpaired.fastq C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\trim_L25_T25\output_reverse_paired.fastq C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\trim_L25_T25\output_reverse_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:25 TRAILING:25 MINLEN:30

Ok, yes: that does execute just fine when input manually to the command line. BUT if I use the function to prepare the commands as a string, then copy and paste that into the command line, the output also fails to parse. So there's something specific about the difference in the code prepared by the function, not the effect of just running it from within the notebook.

The commands aren't formatted EXACTLY the same way, however; the manual one I did above has slightly different names given to the output fastq files, and each path contains double-backslashes. I'll try re-formatting the input command for a manual submission to more closely match the function's output, to try to narrow down possible sources of this difference:

# (on PC):
cd C:\Users\DMacKellar\Documents\Python\BioPython\Trimmomatic-0.36\adapters
java -jar C:\Users\DMacKellar\Documents\Python\BioPython\Trimmomatic-0.36\trimmomatic-0.36.jar PE -phred33  C:\\Users\\DMacKellar\\Documents\\Python\\BioPython\\Galaxy_rnaseq\\Galaxy2-[adrenal_1.fastq].fastqsanger C:\\Users\\DMacKellar\\Documents\\Python\\BioPython\\Galaxy_rnaseq\\Galaxy3-[adrenal_2.fastq].fastqsanger C:\\Users\\DMacKellar\\Documents\\Python\\BioPython\\Galaxy_rnaseq\\trim_L25_T25\\Galaxy2-[adrenal_1_trimmed_paired.fastq C:\\Users\\DMacKellar\\Documents\\Python\\BioPython\\Galaxy_rnaseq\\trim_L25_T25\\Galaxy2-[adrenal_1_trimmed_unpaired.fastq C:\\Users\\DMacKellar\\Documents\\Python\\BioPython\\Galaxy_rnaseq\\trim_L25_T25\\Galaxy2-[adrenal_1_trimmed_paired.fastq C:\\Users\\DMacKellar\\Documents\\Python\\BioPython\\Galaxy_rnaseq\\trim_L25_T25\\Galaxy2-[adrenal_1_trimmed_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:25 TRAILING:25 MINLEN:30

Aha, now THAT generates an error when parsing the output. There's something about these filename paths that's somehow causing Trimmomatic to behave erratically.

# (on PC):
cd C:\Users\DMacKellar\Documents\Python\BioPython\Trimmomatic-0.36\adapters
java -jar C:/Users/DMacKellar/Documents/Python/BioPython/Trimmomatic-0.36/trimmomatic-0.36.jar PE -phred33  C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/Galaxy2-[adrenal_1.fastq].fastqsanger C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/Galaxy3-[adrenal_2.fastq].fastqsanger C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/trim_L25_T25/Galaxy2-[adrenal_1_trimmed_paired.fastq C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/trim_L25_T25/Galaxy2-[adrenal_1_trimmed_unpaired.fastq C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/trim_L25_T25/Galaxy2-[adrenal_1_trimmed_paired.fastq C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/trim_L25_T25/Galaxy2-[adrenal_1_trimmed_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:25 TRAILING:25 MINLEN:30

Well, getting rid of all the backslashes doesn't help; it's still an error upon parsing. I'll try just changing the filenames.

# (on PC):
cd C:\Users\DMacKellar\Documents\Python\BioPython\Trimmomatic-0.36\adapters
java -jar C:/Users/DMacKellar/Documents/Python/BioPython/Trimmomatic-0.36/trimmomatic-0.36.jar PE -phred33  C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/Galaxy2-[adrenal_1.fastq].fastqsanger C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/Galaxy3-[adrenal_2.fastq].fastqsanger C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/trim_L25_T25/a.fastq C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/trim_L25_T25/b.fastq C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/trim_L25_T25/c.fastq C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/trim_L25_T25/d.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:25 TRAILING:25 MINLEN:30

That helps. But now I finally see the difference: it wasn't the path or filenames of the outputs above that mattered so much, it was the fact (that I didn't notice up until now) that running the function's output generates only two output files, both labeled 'Galaxy2...', instead of two each for Galaxy2 and Galaxy3. It seems that, when I modified the function above to generate output paths for the full dict, I didn't consider that it might have a different length than the input path lists.

Here's the code from the function (lines 95-117) that was causing the issue:

# Format and output the relevant commands
fastq_paths = [fastq_dict[x]['path'] for x in sorted(fastq_dict)]
output_paths = [os.path.join(name_dir[x], x) for x in sorted(fastq_dict)]
forward_paths = fastq_paths[0::2]
reverse_paths = fastq_paths[1::2]

for f_p, r_p, out in zip(forward_paths, reverse_paths, output_paths):
    trim_commands = ['java -jar', trim_path, 'PE', use_kwargs['phred'],
                    use_kwargs['ILLUMINACLIP'], use_kwargs['LEADING'], 
                    use_kwargs['TRAILING'], use_kwargs['SLIDINGWINDOW'], 
                    use_kwargs['MINLEN']]

    to_insert = [f_p, r_p, '%s_trimmed_paired.fastq' % out,
                 '%s_trimmed_unpaired.fastq' % out,
                 '%s_trimmed_paired.fastq' % out,
                 '%s_trimmed_unpaired.fastq' % out]
    for i, thing in enumerate(to_insert):
        trim_commands.insert(4+i, thing)

    trim_commands = ' '.join(trim_commands)
    commands.append(trim_commands)

commands = sep.join(commands)

When I add a print statement to get the length of each list being fed into the 'for f_p...' line, I get:

output_paths: 4 forward_paths:  2 reverse_paths:  2

So the function was writing 'Galaxy2..._trimmed_unpaired.fastq' out, then overwriting it with the output from trimming Galaxy3, and this was somehow screwing up the output in a way that made individual reads inconsistent. I'll fix that now, by subjecting the 'output_paths' variable to the same slicing that the 'fastq_paths' list gets.

Ok, that appears to have done the trick. Man, it's super annoying how long it can take to track down such a simple error when coding. Still, at least I was able to take the appropriate steps to keep things moving forward.

Anyways, the next step was going to be to filter the trimmed reads for average quality, but I'll skip that for now. This short post suggests that it's not very applicable in the preprocessing of sequence data.


BUT: The adapters are still there. Using the ILLUMINACLIP settings didn't change the per-base skew that's apparent when looking at the plots. I'm betting the Trimmomatic-supplied fasta files are out-of-date, or otherwise weren't the right ones for this dataset. After looking around online, I found a more comprehensive collection of contaminant sequences, and will try using that as an input file for trimmomatic. Unfortunately, the file's format isn't fasta, so I'll have to re-format it first.

In [40]:
import re

path1 = r'C:\Users\DMacKellar\Dropbox\Coding\\
Python\BioPython\idot_github_contaminant_list.txt'
path2 = r'C:\Users\DMacKellar\Dropbox\Coding\\
Python\BioPython\idot_github_contaminant_list.fasta'
path3 = r'C:\Users\DMacKellar\Documents\Python\\
BioPython\Trimmomatic-0.36\adapters\idot_github_contaminant_list.fasta'

with open(path1, 'r') as f:
    lines = f.readlines()
    pattern = re.compile(r'(\t)+')
    illumina = []
    
    for line in lines[17:160]:
        substitutes = {' ': '_'}#, pattern.match(line): '\n'}
        for key, value in substitutes.items():
            if line == '\n':
                continue
            newline = '>'+line.replace(key, value)
            newline = re.sub(pattern, '\n', newline)
        illumina.append(newline)
        
with open(path3, 'w') as f2:
    for line in illumina:
        f2.write(line)
In [41]:
illumina[:3]
Out[41]:
['>Illumina_Single_End_Adapter_1\nACACTCTTTCCCTACACGACGCTGTTCCATCT\n',
 '>Illumina_Single_End_Adapter_2\nCAAGCAGAAGACGGCATACGAGCTCTTCCGATCT\n',
 '>Illumina_Single_End_PCR_Primer_1\nAATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT\n']
In [42]:
trim_kwargs = {'ILLUMINACLIP': 
               'ILLUMINACLIP:idot_github_contaminant_list.fasta:2:30:7'}

stdout = trimmomatic_paired(biopy_fastqs,
                            mac_pc='pc',
                            verbose=True,
                            subdir=True,
                            subdir_suffix='contaminant_trim',
                            **trim_kwargs)
kwarg key, value pairs used:  {'ILLUMINACLIP': 'ILLUMINACLIP:idot_github_contaminant_list.fasta:2:30:7', 'phred': '-phred33', 'LEADING': 'LEADING:25', 'TRAILING': 'TRAILING:25', 'SLIDINGWINDOW': '', 'MINLEN': 'MINLEN:30'}
TrimmomaticPE: Started with arguments:
 -phred33 C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\Galaxy2-[adrenal_1.fastq].fastqsanger C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\Galaxy3-[adrenal_2.fastq].fastqsanger C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\contaminant_trim\Galaxy2-[adrenal_1_trimmed_paired.fastq C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\contaminant_trim\Galaxy2-[adrenal_1_trimmed_unpaired.fastq C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\contaminant_trim\Galaxy3-[adrenal_2_trimmed_paired.fastq C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\contaminant_trim\Galaxy3-[adrenal_2_trimmed_unpaired.fastq ILLUMINACLIP:idot_github_contaminant_list.fasta:2:30:7 LEADING:25 TRAILING:25 MINLEN:30
Multiple cores found: Using 4 threads
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATATCAGTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGCTCATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATAGGAATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Medium Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCTTTTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Long Clipping Sequence: 'ACAGGTTCAGAGTTCTACAGTCCGAC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTAGTTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT'
Using Long Clipping Sequence: 'ACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Long Clipping Sequence: 'GTTCAGAGTTCTACAGTCCGACGATC'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'CGACAGGTTCAGAGTTCTACAGTCCGACGATC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGGCCACGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGAAACGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGTACGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCCACTCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGCTACCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTGTTGGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATATTCCGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATAGCTAGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGTATAGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Medium Clipping Sequence: 'GCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'ACAGGTTCAGAGTTCTACAGTCCGACATG'
Using Long Clipping Sequence: 'CCGACAGGTTCAGAGTTCTACAGTCCGACATG'
Using Medium Clipping Sequence: 'TCGGACTGTAGAACTCTGAAC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCCGGTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATATCGTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTGAGTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGCCTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Skipping duplicate Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGCCATGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATAAAATGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CTCGGCATTCCTGCTGAACCGCTCTTCCGATCT'
Using Long Clipping Sequence: 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Skipping duplicate Clipping Sequence: 'ACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Medium Clipping Sequence: 'TCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG'
Skipping duplicate Clipping Sequence: 'AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGGAACTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'GATCGTCGGACTGTAGAACTCTGAAC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTGACATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGGACGGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTTGAATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCTCTACGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGCGGACGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTTTCACGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Using Medium Clipping Sequence: 'GATCGGAAGAGCACACGTCT'
Skipping duplicate Clipping Sequence: 'ACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Skipping duplicate Clipping Sequence: 'ACAGGTTCAGAGTTCTACAGTCCGAC'
Skipping duplicate Clipping Sequence: 'GTTCAGAGTTCTACAGTCCGACGATC'
Using Medium Clipping Sequence: 'TCGTATGCCGTCTTCTGCTTGT'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATAAGCTAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGTAGCCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTTGACTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Skipping duplicate Clipping Sequence: 'CCGACAGGTTCAGAGTTCTACAGTCCGACATG'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTCAAGTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCACTGTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCTGATCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Skipping duplicate Clipping Sequence: 'CGACAGGTTCAGAGTTCTACAGTCCGACGATC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT'
Skipping duplicate Clipping Sequence: 'ACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATAAGCTAGTGACTGGAGTTC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGTAGCCGTGACTGGAGTTC'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Skipping duplicate Clipping Sequence: 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTC'
Using Long Clipping Sequence: 'ATCTCGTATGCCGTCTTCTGCTTG'
Skipping duplicate Clipping Sequence: 'AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Skipping duplicate Clipping Sequence: 'AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCTTCGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTGCCGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Skipping duplicate Clipping Sequence: 'ACAGGTTCAGAGTTCTACAGTCCGACATG'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCTGATCGTGACTGGAGTTC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTCAAGTGTGACTGGAGTTC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTC'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCACTGTGTGACTGGAGTTC'
Skipping duplicate Clipping Sequence: 'TCGTATGCCGTCTTCTGCTTG'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'ACACTCTTTCCCTACACGACGCTGTTCCATCT'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT'
Using Long Clipping Sequence: 'CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTCTGAGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGTCGTCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Skipping duplicate Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGATTAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Skipping duplicate Clipping Sequence: 'ACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGCTGTAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATATTATAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGAATGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTCGGGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Skipping duplicate Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Skipping duplicate Clipping Sequence: 'AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA'
Skipping duplicate Clipping Sequence: 'TCGTATGCCGTCTTCTGCTTGT'
Skipping duplicate Clipping Sequence: 'CGACAGGTTCAGAGTTCTACAGTCCGACGATC'
ILLUMINACLIP: Using 0 prefix pairs, 96 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 50121 Both Surviving: 46333 (92.44%) Forward Only Surviving: 2454 (4.90%) Reverse Only Surviving: 832 (1.66%) Dropped: 502 (1.00%)
TrimmomaticPE: Completed successfully
TrimmomaticPE: Started with arguments:
 -phred33 C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\Galaxy4-[brain_1.fastq].fastqsanger C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\Galaxy5-[brain_2.fastq].fastqsanger C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\contaminant_trim\Galaxy4-[brain_1_trimmed_paired.fastq C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\contaminant_trim\Galaxy4-[brain_1_trimmed_unpaired.fastq C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\contaminant_trim\Galaxy5-[brain_2_trimmed_paired.fastq C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\contaminant_trim\Galaxy5-[brain_2_trimmed_unpaired.fastq ILLUMINACLIP:idot_github_contaminant_list.fasta:2:30:7 LEADING:25 TRAILING:25 MINLEN:30
Multiple cores found: Using 4 threads
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATATCAGTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGCTCATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATAGGAATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Medium Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCTTTTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Long Clipping Sequence: 'ACAGGTTCAGAGTTCTACAGTCCGAC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTAGTTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT'
Using Long Clipping Sequence: 'ACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Long Clipping Sequence: 'GTTCAGAGTTCTACAGTCCGACGATC'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'CGACAGGTTCAGAGTTCTACAGTCCGACGATC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGGCCACGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGAAACGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGTACGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCCACTCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGCTACCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTGTTGGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATATTCCGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATAGCTAGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGTATAGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Medium Clipping Sequence: 'GCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'ACAGGTTCAGAGTTCTACAGTCCGACATG'
Using Long Clipping Sequence: 'CCGACAGGTTCAGAGTTCTACAGTCCGACATG'
Using Medium Clipping Sequence: 'TCGGACTGTAGAACTCTGAAC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCCGGTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATATCGTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTGAGTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGCCTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Skipping duplicate Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGCCATGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATAAAATGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CTCGGCATTCCTGCTGAACCGCTCTTCCGATCT'
Using Long Clipping Sequence: 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Skipping duplicate Clipping Sequence: 'ACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Medium Clipping Sequence: 'TCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG'
Skipping duplicate Clipping Sequence: 'AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGGAACTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'GATCGTCGGACTGTAGAACTCTGAAC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTGACATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGGACGGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTTGAATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCTCTACGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGCGGACGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTTTCACGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Using Medium Clipping Sequence: 'GATCGGAAGAGCACACGTCT'
Skipping duplicate Clipping Sequence: 'ACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Skipping duplicate Clipping Sequence: 'ACAGGTTCAGAGTTCTACAGTCCGAC'
Skipping duplicate Clipping Sequence: 'GTTCAGAGTTCTACAGTCCGACGATC'
Using Medium Clipping Sequence: 'TCGTATGCCGTCTTCTGCTTGT'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATAAGCTAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGTAGCCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTTGACTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Skipping duplicate Clipping Sequence: 'CCGACAGGTTCAGAGTTCTACAGTCCGACATG'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTCAAGTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCACTGTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCTGATCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Skipping duplicate Clipping Sequence: 'CGACAGGTTCAGAGTTCTACAGTCCGACGATC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT'
Skipping duplicate Clipping Sequence: 'ACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATAAGCTAGTGACTGGAGTTC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGTAGCCGTGACTGGAGTTC'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Skipping duplicate Clipping Sequence: 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTC'
Using Long Clipping Sequence: 'ATCTCGTATGCCGTCTTCTGCTTG'
Skipping duplicate Clipping Sequence: 'AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Skipping duplicate Clipping Sequence: 'AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCTTCGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTGCCGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Skipping duplicate Clipping Sequence: 'ACAGGTTCAGAGTTCTACAGTCCGACATG'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCTGATCGTGACTGGAGTTC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTCAAGTGTGACTGGAGTTC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTC'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTC'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCACTGTGTGACTGGAGTTC'
Skipping duplicate Clipping Sequence: 'TCGTATGCCGTCTTCTGCTTG'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'ACACTCTTTCCCTACACGACGCTGTTCCATCT'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT'
Using Long Clipping Sequence: 'CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTCTGAGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGTCGTCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Skipping duplicate Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATCGATTAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Skipping duplicate Clipping Sequence: 'ACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGCTGTAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATATTATAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATGAATGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGATTCGGGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA'
Skipping duplicate Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Skipping duplicate Clipping Sequence: 'CAAGCAGAAGACGGCATACGA'
Skipping duplicate Clipping Sequence: 'AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA'
Skipping duplicate Clipping Sequence: 'TCGTATGCCGTCTTCTGCTTGT'
Skipping duplicate Clipping Sequence: 'CGACAGGTTCAGAGTTCTACAGTCCGACGATC'
ILLUMINACLIP: Using 0 prefix pairs, 96 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 37992 Both Surviving: 30815 (81.11%) Forward Only Surviving: 1758 (4.63%) Reverse Only Surviving: 4473 (11.77%) Dropped: 946 (2.49%)
TrimmomaticPE: Completed successfully
In [43]:
trim2_dir = r'C:/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/contaminant_trim/'
trim2_dict = parse_fastq(trim2_dir)
Added fastq file Galaxy2-[adrenal_1_trimmed_paired
Added fastq file Galaxy2-[adrenal_1_trimmed_unpaired
Added fastq file Galaxy3-[adrenal_2_trimmed_paired
Added fastq file Galaxy3-[adrenal_2_trimmed_unpaired
Added fastq file Galaxy4-[brain_1_trimmed_paired
Added fastq file Galaxy4-[brain_1_trimmed_unpaired
Added fastq file Galaxy5-[brain_2_trimmed_paired
Added fastq file Galaxy5-[brain_2_trimmed_unpaired

Finished parsing directory.
In [44]:
%%time

plot_fastq(trim2_dict, per_row=4)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<timed eval> in <module>()

TypeError: plot_fastq() got an unexpected keyword argument 'per_row'
In [45]:
import matplotlib.pyplot as plt

allyourbase = pd.DataFrame()

for x in trim2_dict:
#     trim2_dict[x]['per_base_percents_df'].A.plot(kind='line')
    plt.plot(trim2_dict[x]['per_base_percents_df'].A, label=x[:20])
    allyourbase = allyourbase.append(trim2_dict[x]['per_base_percents_df'].A)
    
plt.title('"A" Content per base\nin trimmed Reads')
plt.xlabel('Base position in reads')
plt.ylabel('%A')
ticks = allyourbase.sum().T.nlargest(3).index.tolist()
ticks.extend(range(20, 50, 10))
plt.xticks(ticks)
plt.grid(True, axis='x')
plt.legend()
plt.show()

Ok, that didn't get rid of the biased first few bases either. As far as I know, there's no compelling biological reason that the reads should all have A content that peaks at bases 3, 5, and 9 in every transcript. This is almost certainly a sequencing artifact.

Let's try a more direct approach: can I just look at a dozen or so sequences and see the commonalities in those first few bases?

In [46]:
biopy_fastqs['Galaxy2-[adrenal_1']['df']['sequence'][:10]
Out[46]:
0    [A, T, C, T, T, T, T, G, T, G, G, C, T, A, C, ...
1    [T, C, C, A, T, A, C, A, T, A, G, G, C, C, T, ...
2    [G, T, A, T, A, A, C, G, C, T, A, G, A, C, A, ...
3    [A, A, C, G, G, A, T, C, C, A, T, T, G, T, T, ...
4    [G, C, T, A, A, T, C, C, G, A, C, T, T, C, T, ...
5    [T, G, G, A, C, A, G, T, T, G, C, T, C, C, T, ...
6    [A, T, T, A, G, G, A, A, A, C, A, T, G, G, A, ...
7    [C, A, A, T, A, G, C, C, A, G, A, T, G, G, T, ...
8    [G, T, G, C, C, A, A, A, T, T, G, T, C, A, C, ...
9    [C, C, C, G, G, C, C, T, A, A, C, T, T, T, C, ...
Name: sequence, dtype: object

Going Back to the Source

Yeah, I still don't see it. I don't know what's causing that base content skew. Maybe they're nothing to worry about, but it's a very noticable regularity.

Maybe I should check the raw data from the experiment, rather than this curated source? I doubt that it's so different, but it's worth noting that the Illumina BodyMap 2.0 project generated far, far more data than this limited dataset. I can't be certain that other aspects of the data weren't altered.

As noted at the top of this notebook, the full set of reads are available on NCBI's SRA. To interact with that repository, however, they require you to download their special software; the reads aren't available via any kind of structured URL query.

SRA Toolkit Setup

Binaries are available for Linux, Mac OS X, and Windows. Instructions are here. I put the unzipped output on the Mac at:

/Users/drew/Documents/Data/Bio/sratoolkit

On PC at:

C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\sratoolkit\sratoolkit.2.8.2-1-win64\bin

Once those are installed, use the prefetch tool to download the sequences with the specific accession number. Or, perhaps that's unnecessary.

Then, you have to use their fastq-dump program to get the FastQ (or FASTA) file from the raw image data that includes the flow cell the way the sequencer saw it.

# (on Mac):
cd /Users/drew/Documents/Data/Python/Galaxy_rnaseq
curl ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/ERP000546
/Users/drew/Documents/Data/Bio/sratoolkit/bin/fastq-dump -X 5 -Z ERP000546

Hmm... those both return errors. Using the 'fastq-dump' command gives:

fastq-dump.2.8.2 err: item not found while constructing within virtual database module - the path 'ERP000546' cannot be opened as database or table

Using curl gives:

curl: (78) RETR response: 550

Which apparently means (~99% of the time) 'file not found'. So I'm having trouble figuring out how to download SRA data.

On PC, the trial code:

fastq-dump.exe -X 5 -Z SRR390728

does work, outputting the first 5 reads from that dataset. I didn't have to run the config for the tool. I'll try

cd C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\SRA_raw
C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\sratoolkit\sratoolkit.2.8.2-1-win64\bin\fastq-dump.exe -X 10000 -A C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\accession.txt

Nope, that gives:

2017-11-10T03:01:31 fastq-dump.2.8.2 err: item not found while constructing within virtual database module - the path '/C/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/accession.txt' cannot be opened as database or table

Apparently just feeding a text file of accessions won't work with the '-A' flag.

Maybe this is a job for prefetch?

cd C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\SRA_raw
C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\sratoolkit\sratoolkit.2.8.2-1-win64\bin\prefetch.exe --list C:\Users\DMacKellar\Documents\Python\BioPython\Galaxy_rnaseq\accession.txt

Nope; that wants a properly formatted 'kart file', and it sounds like the only way to get that is to go through the website. It sounds like they expect you to just enter multiple calls to prefetch or fastq-dump with one accession each. I guess that makes sense when many modern fastq raw read files are >1GB in size.

Note: you'll want to use the 'split-3' arg with fastq-dump in order to separate the F and R reads into separate files.

I think I'll try to do this via the subprocess module:

In [72]:
if o_s == 'Windows':
    exe_ext = '.exe'
else:
    exe_ext = ''
paths['fastq_dump'] = os.path.join(paths['sra_tools_dir'], 
                                   'fastq-dump'+exe_ext)
paths['prefetch'] = os.path.join(paths['sra_tools_dir'], 
                                 'prefetch'+exe_ext)
In [7]:
import subprocess

os.chdir(paths['data_dir'])

commands1 = '{} -X 5 -Z SRR390728'.format(paths['fastq_dump'])
commands2 = '{} -X 10MB SRR390728'.format(paths['prefetch'])
commands3 = [paths['fastq_dump'], '-X', '5', '-Z', 'SRR390729']
commands4 = [paths['fastq_dump'], '-X', '5', '-Z', 'SRR390729']
subprocess.run(commands4, stdout=subprocess.PIPE, 
               stderr=subprocess.STDOUT, shell=False, check=False)
Out[7]:
CompletedProcess(args=['/Users/drew/Documents/Data/Bio/sratoolkit/bin/fastq-dump', '-X', '5', '-Z', 'SRR390729'], returncode=0, stdout=b'Read 5 spots for SRR390729\nWritten 5 spots for SRR390729\n@SRR390729.1 1 length=100\nACCCTTCCCAACACCCTGGGAACCTATGGGGCCAGGCGTTCCTTACCAAAGCTCATGAGAAATACATCGAGCACAAAACTCTTCATCAGCTGGTTTTACT\n+SRR390729.1 1 length=100\n::::::::::::::::::::::::::::::44444///////////////::::::::::::::::::::::::::1:::44443///////////////\n@SRR390729.2 2 length=100\nGGATATTCGTATTCATCTTAGTGGATAAATACCACCTTACTTGGAAAATACTTCATCTGTAAAATAGAGGACTGCATGGTTCTGTGTATTTTAGAGAAGG\n+SRR390729.2 2 length=100\n::::::::::::::::::::::::::::::44444/////////--////:::::::::::::0::::::::::::::::44444///////////////\n@SRR390729.3 3 length=100\nAAACGCCAACAGAATGGTATTCAGTATGAATGAAAGAACTAAATTTTAACTTTGGTTGACTCATCTTTATAAGATGACCAGGCTAGAGAATCAGAGATCA\n+SRR390729.3 3 length=100\n::::::::::::::::::::::::::::::44444//////////////-::::::::::::::::::::::::::::::44443///////////-///\n@SRR390729.4 4 length=100\nTAGCATTTCCATGATACAAACTAAATAACTCTATGGATTTTCTAATGGAAACACACACACACATACACACGCAGAAACAGAAGAGAAGTGGGTGAGATAT\n+SRR390729.4 4 length=100\n::::::::::::::::::::::::::::::44444/////////////-/::::::::::::::::::::::::::::::44434///////////////\n@SRR390729.5 5 length=100\nAATAATTTGCCAAATTTCTCTCCTAAGACCTAATAATGGATAACACAAACCAATAGTCCTTTTGAATAATCATGGCCTAGCTGTGGTTTTAAAAACATAT\n+SRR390729.5 5 length=100\n::::::::::::::::::::::::::::::44444///////////////::::::::::::::::::::::::::.:::43343///////////-///\n')
In [68]:
def sra_download(accession_list, *args, **kwargs):
    kwargs_new = []
    for k, v in kwargs.items():
        kwargs_new.append(k)
        kwargs_new.append(str(v))
    try:
        out_dir
    except NameError:
        pass
    else:
        if not os.path.isdir(out_dir):
            os.mkdir(out_dir)
        os.chdir(out_dir)
    print('Downloading {} files from the NCBI SRA server...'.format(
        len(accession_list)))
    for i, acc in enumerate(accession_list):
        cmds = [paths['fastq_dump'], *args, *kwargs_new, str(acc)]
        p = subprocess.Popen(cmds, stdout=subprocess.PIPE, 
                             stderr=subprocess.PIPE, shell=False)
        out, err = p.communicate()
        if len(err) > 0:
            print('\n{:>4} {} returned an error: \n{}\n'.format(
                i+1, acc, err.decode('ascii')))
        else:
            print('{:>4} {} downloaded'.format(i+1, acc))
In [15]:
os.chdir(os.path.join(paths['data_dir'], 'split3'))
In [9]:
args = ['--split-3']
kwargs = {'-X': '10000'}
acc1 = ['SRR390728']

sra_download(acc1, *args, **kwargs)
Downloading 1 files from the NCBI SRA server...
   1 SRR390728 downloaded

Ok, that's working. Now, for the purposes of trying out additional steps in the pipeline, and keeping everything lightweight-enough to keep in local memory, I'll just download the first 10,000 spots associated with each of the tissues in the BodyMap dataset.

In [16]:
%%time

# DCM Note: After running this once locally, comment out
# so we don't repeat the download whenever notebook is run

args = ['--split-3']
kwargs = {'-X': '10000'}
accession_list = sra_table['Run']

sra_download(accession_list, *args, **kwargs)
Downloading 48 files from the NCBI SRA server...
   1 ERR030858 downloaded
   2 ERR030859 downloaded
   3 ERR030860 downloaded
   4 ERR030861 downloaded
   5 ERR030862 downloaded
   6 ERR030863 downloaded
   7 ERR030864 downloaded
   8 ERR030865 downloaded
   9 ERR030866 downloaded
  10 ERR030867 downloaded
  11 ERR030868 downloaded
  12 ERR030869 downloaded
  13 ERR030870 downloaded
  14 ERR030871 downloaded
  15 ERR030872 downloaded

  16 ERR030873 returned an error: 
2018-06-16T05:20:31 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:20:31 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4644011486 bytes
2018-06-16T05:20:31 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:20:32 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:20:32 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4644011486 bytes
2018-06-16T05:20:32 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  17 ERR030874 returned an error: 
2018-06-16T05:20:41 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:20:41 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4549781976 bytes
2018-06-16T05:20:41 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:20:42 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:20:42 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4549781976 bytes
2018-06-16T05:20:42 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  18 ERR030875 returned an error: 
2018-06-16T05:20:50 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:20:50 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4591110380 bytes
2018-06-16T05:20:50 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:20:52 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:20:52 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4591110380 bytes
2018-06-16T05:20:52 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  19 ERR030876 returned an error: 
2018-06-16T05:21:00 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:21:00 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4503544683 bytes
2018-06-16T05:21:00 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:21:01 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:21:01 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4503544683 bytes
2018-06-16T05:21:01 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  20 ERR030877 returned an error: 
2018-06-16T05:21:10 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:21:10 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4274557907 bytes
2018-06-16T05:21:10 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:21:11 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:21:11 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4274557907 bytes
2018-06-16T05:21:11 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  21 ERR030878 returned an error: 
2018-06-16T05:21:19 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:21:19 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4034661681 bytes
2018-06-16T05:21:19 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:21:21 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:21:21 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4034661681 bytes
2018-06-16T05:21:21 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  22 ERR030879 returned an error: 
2018-06-16T05:21:30 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:21:30 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4183525678 bytes
2018-06-16T05:21:30 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:21:32 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:21:32 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4183525678 bytes
2018-06-16T05:21:32 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  23 ERR030880 returned an error: 
2018-06-16T05:21:40 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:21:40 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4086763666 bytes
2018-06-16T05:21:40 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:21:41 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:21:41 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4086763666 bytes
2018-06-16T05:21:41 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  24 ERR030881 returned an error: 
2018-06-16T05:21:51 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:21:51 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 3663840382 bytes
2018-06-16T05:21:51 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:21:53 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:21:53 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 3663840382 bytes
2018-06-16T05:21:53 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  25 ERR030882 returned an error: 
2018-06-16T05:22:17 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:22:17 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4211553402 bytes
2018-06-16T05:22:17 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:22:19 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:22:19 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4211553402 bytes
2018-06-16T05:22:19 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  26 ERR030883 returned an error: 
2018-06-16T05:22:27 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:22:27 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 3747076106 bytes
2018-06-16T05:22:27 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:22:29 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:22:29 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 3747076106 bytes
2018-06-16T05:22:29 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  27 ERR030884 returned an error: 
2018-06-16T05:22:37 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:22:37 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4465828573 bytes
2018-06-16T05:22:37 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:22:38 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:22:38 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4465828573 bytes
2018-06-16T05:22:38 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  28 ERR030885 returned an error: 
2018-06-16T05:22:46 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:22:46 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4329482848 bytes
2018-06-16T05:22:46 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:22:48 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:22:48 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4329482848 bytes
2018-06-16T05:22:48 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  29 ERR030886 returned an error: 
2018-06-16T05:22:56 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:22:56 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4251850780 bytes
2018-06-16T05:22:56 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:22:57 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:22:57 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4251850780 bytes
2018-06-16T05:22:57 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  30 ERR030887 returned an error: 
2018-06-16T05:23:07 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:23:07 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4219250662 bytes
2018-06-16T05:23:07 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:23:08 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:23:08 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4219250662 bytes
2018-06-16T05:23:08 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  31 ERR030888 returned an error: 
2018-06-16T05:23:17 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:23:17 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4269678434 bytes
2018-06-16T05:23:17 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:23:18 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:23:18 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4269678434 bytes
2018-06-16T05:23:18 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  32 ERR030889 returned an error: 
2018-06-16T05:23:26 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:23:26 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4214590189 bytes
2018-06-16T05:23:26 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:23:28 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:23:28 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4214590189 bytes
2018-06-16T05:23:28 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  33 ERR030857 returned an error: 
2018-06-16T05:23:36 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:23:36 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4393870053 bytes
2018-06-16T05:23:36 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:23:38 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:23:38 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4393870053 bytes
2018-06-16T05:23:38 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  34 ERR030856 returned an error: 
2018-06-16T05:23:47 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:23:47 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4297829576 bytes
2018-06-16T05:23:47 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:23:48 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:23:48 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4297829576 bytes
2018-06-16T05:23:48 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely


  35 ERR030890 downloaded

  36 ERR030891 returned an error: 
2018-06-16T05:26:47 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:26:47 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4304747593 bytes
2018-06-16T05:26:47 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:26:48 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:26:48 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4304747593 bytes
2018-06-16T05:26:48 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  37 ERR030892 returned an error: 
2018-06-16T05:26:58 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:26:58 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4514127484 bytes
2018-06-16T05:26:58 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:26:59 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:26:59 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4514127484 bytes
2018-06-16T05:26:59 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  38 ERR030893 returned an error: 
2018-06-16T05:27:08 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:27:08 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4451723433 bytes
2018-06-16T05:27:08 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:27:10 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:27:10 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4451723433 bytes
2018-06-16T05:27:10 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  39 ERR030894 returned an error: 
2018-06-16T05:27:28 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:27:28 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4384862345 bytes
2018-06-16T05:27:28 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:27:30 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:27:30 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4384862345 bytes
2018-06-16T05:27:30 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  40 ERR030895 returned an error: 
2018-06-16T05:27:38 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:27:38 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4306556654 bytes
2018-06-16T05:27:38 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:27:40 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:27:40 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4306556654 bytes
2018-06-16T05:27:40 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  41 ERR030896 returned an error: 
2018-06-16T05:27:48 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:27:48 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4389038264 bytes
2018-06-16T05:27:48 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:27:49 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:27:49 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4389038264 bytes
2018-06-16T05:27:49 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  42 ERR030897 returned an error: 
2018-06-16T05:27:58 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:27:58 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4474485461 bytes
2018-06-16T05:27:58 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:27:59 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:27:59 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4474485461 bytes
2018-06-16T05:27:59 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  43 ERR030898 returned an error: 
2018-06-16T05:28:15 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:28:15 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4554494573 bytes
2018-06-16T05:28:15 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:28:16 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:28:16 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4554494573 bytes
2018-06-16T05:28:16 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  44 ERR030899 returned an error: 
2018-06-16T05:28:25 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:28:25 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4522881547 bytes
2018-06-16T05:28:25 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:28:26 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:28:26 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4522881547 bytes
2018-06-16T05:28:26 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  45 ERR030900 returned an error: 
2018-06-16T05:28:34 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:28:34 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4564540209 bytes
2018-06-16T05:28:34 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:28:36 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:28:36 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4564540209 bytes
2018-06-16T05:28:36 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  46 ERR030901 returned an error: 
2018-06-16T05:28:45 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:28:45 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4379516773 bytes
2018-06-16T05:28:45 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:28:46 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:28:46 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4379516773 bytes
2018-06-16T05:28:46 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  47 ERR030902 returned an error: 
2018-06-16T05:28:55 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:28:55 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4539800668 bytes
2018-06-16T05:28:55 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:28:56 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:28:56 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4539800668 bytes
2018-06-16T05:28:56 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely



  48 ERR030903 returned an error: 
2018-06-16T05:29:04 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:29:04 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4346571367 bytes
2018-06-16T05:29:04 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely
2018-06-16T05:29:06 fastq-dump.2.8.2 err: unknown while updating file within file system module - unknown system error 'No space left on device(28)'
2018-06-16T05:29:06 fastq-dump.2.8.2 int: unknown while updating file within file system module - cannot size local file to 4346571367 bytes
2018-06-16T05:29:06 fastq-dump.2.8.2 int: no error - skipping the cache-tee completely


CPU times: user 163 ms, sys: 410 ms, total: 574 ms
Wall time: 9min 19s

Importing Reads

Now that I've downloaded a sample of the data sets, let's split them into the relevant experimental condition: the single-end 75bp reads, the paired 50bp reads, and the single-end 100bp reads (this last, remember, comes from pooling the RNA from the various tissues, and is probably uninteresting for this analysis).

Then, I'll import the tissue-specific paired- and single-end reads with Biopython's SeqIO module.

In [3]:
from Bio import SeqIO

def bmap_reads_import(path=None):
    exps, reads = {}, {}
    bad = ['16 Tissues mixture', np.NaN]
    sra_tiss_df = sra_table[~sra_table.loc[:, 'organism_part'].isin(bad)]
    sra_tiss_df_s = sra_tiss_df.sort_values(by=['organism_part', 'LibraryLayout'])
    conv_dict = {'_s': '', '_f': '_1', '_r': '_2'}
    for _, row in sra_tiss_df_s.iterrows():
        if row['LibraryLayout'] == 'SINGLE':
            key = row['organism_part']+'_s'
            exps[key] = row['Run']
        else:
            exps[row['organism_part']+'_f'] = row['Run']
            exps[row['organism_part']+'_r'] = row['Run']
    for exp, run in exps.items():
        if exp not in reads.keys():
            f_name = run+conv_dict[exp[-2:]]+'.fastq'
            f_path = os.path.join(path, f_name)
            if os.path.isfile(f_path):
                with open(f_path, 'r') as f:
                    reads[exp] = SeqIO.to_dict(SeqIO.parse(f, 'fastq'))
            
    return reads
    
In [4]:
%%time

paths['split3'] = os.path.join(paths['data_dir'], 'split3')
bmap_reads = bmap_reads_import(path=paths['split3'])
Wall time: 15.7 s
In [5]:
# Let's just check one of the records
first_rec = bmap_reads['adipose_f']['ERR030880.1']
first_rec
Out[5]:
SeqRecord(seq=Seq('CTGCTTGCAACTANAGCAACAGCCTTCATAGGCTATGNNCTCCCGAGAGG', SingleLetterAlphabet()), id='ERR030880.1', name='ERR030880.1', description='ERR030880.1 HWI-BRUNOP16X_0001:1:1:3016:1084#0 length=50', dbxrefs=[])

Visualizing Read Statistics

Ok, now it's time to check the reads' lengths, aggregate quality scores, and various other attributes. Biopython stores the individual records as instances of their SeqRecord class, and the attribute carrying the PHRED quality scores for each base in the sequence are in SeqRecord.letter_annotations, which is a dict, in this case containing the sole key of phred_quality:

In [6]:
print(first_rec.description)
print(first_rec.letter_annotations['phred_quality'])
ERR030880.1 HWI-BRUNOP16X_0001:1:1:3016:1084#0 length=50
[20, 21, 20, 20, 20, 20, 21, 20, 20, 20, 32, 31, 34, 2, 34, 19, 20, 15, 20, 20, 35, 31, 37, 35, 34, 35, 38, 38, 35, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]

Let's try plotting each tissue type's quality score per base position. I think it would be best to break the quality scores out from their default organization as being values within a nested dict, to being a numpy array where the first axis are individual reads and the second axis is positions within each read. Then I'll output these to a separate dict where each key is the experiment, and the value is the associated array.

In [7]:
def recover_seq_score(reads_dict):
    seq_dict, score_dict = {}, {}
    for exp, d in reads_dict.items():
        n_records = len(d)
        first_rec = d[list(d.keys())[0]]
        length = int(first_rec.description.split()[-1][7:])
        seq_array = np.empty((n_records, length), dtype=np.unicode_)
        score_array = np.empty((n_records, length), dtype=int)
        for i, record in zip(range(n_records), d.values()):
            seq_array[i] = np.asarray(record.seq, dtype=np.unicode_)
            score_array[i] = np.asarray(
                record.letter_annotations['phred_quality'])
        seq_dict[exp] = seq_array
        score_dict[exp] = score_array
    return seq_dict, score_dict
In [8]:
%%time

bmap_seq, bmap_phreds = recover_seq_score(bmap_reads)
Wall time: 1min 6s
In [9]:
first_seq = bmap_seq['adipose_f']
first_score = bmap_phreds['adipose_f']

print('adipose_f sequence array:\n{}\n\nadipose_f score array:\n{}'.format(
    first_seq, first_score))
adipose_f sequence array:
[['C' 'T' 'G' ... 'A' 'G' 'G']
 ['C' 'T' 'C' ... 'G' 'T' 'C']
 ['G' 'C' 'C' ... 'T' 'A' 'G']
 ...
 ['A' 'A' 'T' ... 'T' 'A' 'T']
 ['C' 'A' 'G' ... 'G' 'A' 'G']
 ['C' 'G' 'C' ... 'C' 'G' 'T']]

adipose_f score array:
[[20 21 20 ...  2  2  2]
 [20 20 19 ... 34 27 34]
 [20 20 20 ...  2  2  2]
 ...
 [39 39 39 ... 39 39 39]
 [39 39 39 ...  2  2  2]
 [39 39 39 ... 39 39 39]]

Now I'll need a function to plot the phred scores by position, with each tissue type in a subplot. I'll look at a few other quality measures, too, including sequence duplication level, per-base distribution of different nucleotide values, etc.

In [10]:
def plot_read_stats(seq_dict, score_dict):
    batches = int(np.ceil(len(seq_dict) / 3))
    plt.close('all')
    plt.style.use('ggplot')
    my_dpi = 96
    fig_scale = 500
    # put at most 3 experiments per row
    fig, axs = plt.subplots(nrows=5*batches, ncols=3, sharex=False,
                            figsize=(3*fig_scale/my_dpi, 
                                     3*batches*fig_scale/my_dpi), 
                            dpi=my_dpi)
    first = seq_dict[list(seq_dict.keys())[0]]
    
    labels = ['Per Base\nQuality', 'Per Base\nSequence\nContent', 'Per Base\nN Content', 
             'Per Sequence\nMean Quality\nScore', '# of Seq\nDuplication\nLevel']
    ylims = [(0, 40), (0, 0.6), (0, 0.01), (0, first.shape[0]/3), (0, first.shape[0])]
    for ax in np.ndenumerate(axs):
        ind = ax[0][0]%5
        ax[-1].set_ylim(ylims[ind])
        ax[-1].set_ylabel(labels[ind])

    for i, (exp, d1), d2 in zip(range(len(seq_dict)), seq_dict.items(), score_dict.values()):
        batch = 5*int(np.floor(i / 3))
        i2 = i % 3
        axs[batch, i%3].set_title(exp, fontsize=fig_scale/25)
        n_s = (d1 == 'N').sum(axis=1)
        base_counts = {}
        for x in ['A', 'G', 'C', 'T', 'N']:
            base_counts[x] = (d1 == x).sum(axis=0) / np.count_nonzero(d1, axis=0)
        per_seq_means = d2.mean(axis=1)
        seq_means_counts, _ = np.histogram(per_seq_means, bins=range(0, 41))
        _, cnt = np.unique(d1, axis=0, return_counts=True)
        n_duplicates, _ = np.histogram(cnt, bins=range(11))
        
        axs[batch, i2].boxplot(d2, showfliers=False, 
                               boxprops={'color': 'salmon'}, 
                               whiskerprops={'color':'indianred'})
        axs[batch, i2].set_xlim(0, 75)
        axs[batch, i2].xaxis.set_ticks(np.arange(1, 75, 5))
        axs[batch, i2].xaxis.set_ticklabels(np.arange(1, 75, 1)[::5])
        for k, v in base_counts.items():
            axs[batch+1, i2].plot(range(d1.shape[1]), v, label=k)
        axs[batch+1, i2].legend(loc=1)
        axs[batch+2, i2].plot(range(d1.shape[1]), base_counts['N'])
        axs[batch+3, i2].plot(range(40), seq_means_counts)
        axs[batch+4, i2].plot(range(-1, 9), n_duplicates)
        axs[batch+4, i2].set_xlim(0, 10)
    plt.tight_layout()
    return axs
In [11]:
%%time

plot_read_stats(bmap_seq, bmap_phreds)
plt.show()
Wall time: 1min 45s

That's... surprisingly bad. And Illumina generated these data? Oh well.

Many, many of these reads are not useable. I think I'll want to discard any reads whose mean quality score is below ~20, and the first dozen or so bases from every read, given how skewed the per-base distribution is for the start of the reads, which all seem to share a very stereotypical pattern of spikes that suggests there are untrimmed adapters or something.

Trimming the Reads

In the past, I've mostly used Trimmomatic, but some other sources online have suggested that BBDuk is faster. The latter is part of the BBMap suite of tools; I downloaded it on the PC to the same dir as the other tools.

According to the BodyMap project's protocol info, the RNA was ligated with illumina small RNA v1.5 adaptors. That info should be used when trimming the reads.

In [12]:
illum_seq = {'Small RNA PCR Primer 1': 'CAAGCAGAAGACGGCATACGA',
             'Small RNA PCR Primer 2': 'AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA',
             'Small RNA Sequencing Primer': 'CGACAGGTTCAGAGTTCTACAGTCCGACGATC',
             'v1.5 Small RNA 3\' Adapter': 'ATCTCGTATGCCGTCTTCTGCTTG'}

paths['illum_seq_fasta'] = os.path.join(paths['data_dir'], 'illum_seq.fasta')
with open(paths['illum_seq_fasta'], 'w') as f:
    for name, seq in illum_seq.items():
        f.write('>{}\n'.format(name))
        f.write('{}\n'.format(seq))
        
In [13]:
with open(paths['illum_seq_fasta'], 'r') as f:
    lines = f.readlines()
    adapters = []
    for line in lines: 
        adapters.append(line.rstrip('\n'))
    
print(adapters)
['>Small RNA PCR Primer 1', 'CAAGCAGAAGACGGCATACGA', '>Small RNA PCR Primer 2', 'AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA', '>Small RNA Sequencing Primer', 'CGACAGGTTCAGAGTTCTACAGTCCGACGATC', ">v1.5 Small RNA 3' Adapter", 'ATCTCGTATGCCGTCTTCTGCTTG']

Alternatively, this FAQ post says that BBDuk can be used to remove all known Illumina adapter sequences, with the arg:

ref=truseq.fa.gz,truseq_rna.fa.gz,nextera.fa.gz
In [14]:
if o_s == 'Darwin':
    paths['bbmap_dir'] = '/Users/drew/Documents/Data/Python/bbmap'
if o_s == 'Windows':
    paths['bbmap_dir'] = r'C:\Users\DMacKellar\Documents\\
Python\BioPython\BBMap'

paths['bbduk'] = os.path.join(paths['bbmap_dir'], 'bbduk.sh')
In [15]:
# Already in 'paths', but re-acquire to keep separate
untrimmed_reads = {}
for file in os.listdir(paths['split3']):
    file_under = str(file).replace('.', '_')
    path = os.path.join(paths['split3'], file)
    untrimmed_reads[file_under] = path
        
# Save a copy, since dir might change
paths['untrimmed_reads_list'] = os.path.join(paths['data_dir'], 
                                             'untrimmed_reads_list.txt')
with open(paths['untrimmed_reads_list'], 'w') as f:
    for k, v in untrimmed_reads.items():
        f.write('{},{}\n'.format(k, v))
        
untrimmed_reads_dict = {}
t1, t2, t3 = [], [], []
with open(paths['untrimmed_reads_list'], 'r') as f:
    for line in f.readlines():
        new_line = line.rstrip('\n')
        for x in new_line.split(sep=','):
            t1.append(x)
t2 = t1[0::2]
t3 = t1[1::2]
for x, y in zip(t2, t3):
    untrimmed_reads_dict[x] = y
        
# untrimmed_reads_dict

Now, I'll try a few iterations of the BBDuk trimming program with various arguments set to trim a pared-down dataset, and visualize the output iteratively to look for improvement along the neccessary dimensions. Once I have a workable approach, I'll extend it to the rest of the reads, and visualize the entire output.

The thryoid reads look fairly representative; I'll start with them.

In [16]:
thyroid_reads = {}
for k, v in bmap_reads.items():
    if k[:7] == 'thyroid':
        thyroid_reads[k] = v
print(thyroid_reads.keys())
dict_keys(['thyroid_f', 'thyroid_r', 'thyroid_s'])

For reference, here's what they look like before modification:

In [17]:
thyroid_seq, thyroid_score = recover_seq_score(thyroid_reads)
plot_read_stats(thyroid_seq, thyroid_score)
plt.show()

Using subprocess, try a run of these through bbduk:

In [18]:
thyr = sra_table[sra_table.loc[:, 'organism_part'] == 'thyroid'].loc[:, 'Run'].tolist()
thyroid_files = []
for k, v in untrimmed_reads_dict.items():
    if k[:9] in thyr:
        thyroid_files.append(v)
        
thyroid_files
Out[18]:
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\ERR030872_1.fastq',
 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\ERR030872_2.fastq',
 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\ERR030903.fastq']

The docs for BBDuk say that it should be used with either paired reads or singlet reads per run, not both.

In [19]:
thyroid_paired = thyroid_files[:-1]
thyroid_single = thyroid_files[-1]

First, let's try adapter trimming.

Note: Ok, this worked very straightforward-ly on the Mac, but took forever to get working on the PC. On hte PC, it kept executing without complaint, but writing no output FastQ files. Basically, as the authors note under the "Standard Syntax" heading in the [BBMap usage guide](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/usage-guide/), Windows doesn't have native recognition of & automatic support for Java when executed from the command line. Instead of just specifying the absolute path to the BBDuk shell script, you have to invoke something like:

java -cp /path/to/BBMap/current jgi.BBDukF

And then specify the remainder of the arguments. In other words, you call Java, specify a 'classpath' [notes here](https://javarevisited.blogspot.com/2011/01/how-classpath-work-in-java.html) with the absolute path to the 'current' dir within the BBMap distro, then call specific scripts by specifying `subdir.script` sort of format. Obviously, I'm not very used to running Java from the command line.

In [20]:
import datetime
import subprocess

# Note: pass in fastq files as nested lists; pairs together
def run_bbduk(fastq_list, *args, **kwargs):
    try:
        trim_dirs
    except NameError:
        trim_dirs = {}
        new_dir = '1'
    else:
        new_dir = max([int(x) for x in trim_dirs.keys()]) + 1
    if o_s == 'Windows':
        bbduk = 'java -cp {}\current jgi.BBDukF'.format(paths['bbmap_dir'])
    elif o_s == 'Darwin':
        bbduk = paths['bbduk']
    else:
        print('Error setting operating system')
    kwargs_new = []
    for k, v in kwargs.items():
        kwargs_new.append('{}={}'.format(k, v))
    try:
        out_dir
    except NameError:
        ts = datetime.datetime.now().replace(microsecond=0).timestamp()
        read_ts = datetime.datetime.fromtimestamp(ts).isoformat().replace(':', '_')
        out_dir = os.path.join(os.getcwd(), 'trim_{}'.format(read_ts))
        os.mkdir(out_dir)
    else:
        if not os.path.isdir(out_dir):
            os.mkdir(out_dir)
    trim_dirs[str(new_dir)] = str(out_dir)
    os.chdir(out_dir)
    conditions = [*args, *kwargs_new]
    with open('trim_log.txt', 'w') as f:
        for condition in conditions:
            f.write('{}\n'.format(condition))
    print('Starting trimming run with args:\n{}\n{}\n...'.format(
        args, kwargs_new))
    for i, fq in enumerate(fastq_list):
        cmds = [bbduk, *args, *kwargs_new]
        fq_base = [os.path.splitext(os.path.basename(x))[0] for x in fq]
        out_files = ['{}_t.fastq'.format(os.path.join(out_dir, x)) for x in fq_base]
        if len(fq) > 1:
            cmds.append('in1={0} in2={1} out1={2} out2={3}'.format(fq[0], fq[1], out_files[0], out_files[1]))
        elif len(fq) == 1:
            cmds.append('in1={0} out1={1}'.format(fq[0], out_files[0]))
        else:
            print('I can\'t understand the input fastq_list; \
did you pass a nested list?')
        final_cmds = ' '.join(cmd for cmd in cmds)
#         final_cmds = final_cmds.split(sep=' ')
#         print(final_cmds)
        p = subprocess.Popen(final_cmds, stdout=subprocess.PIPE, 
                             stderr=subprocess.PIPE, shell=True)
        out, err = p.communicate()
        if len(err) > 0:
            print('\n{:>4} {} returned an error: \n{}\n'.format(
                i+1, fq, err.decode('ascii')))
        else:
            for f in fq:
                print('{:>4} {} trimmed'.format(
                        i+1, os.path.basename(f)))
    return trim_dirs
In [21]:
paths['bbduk'] = os.path.join(paths['bbmap_dir'], 'bbduk.sh')
paths['split3'] = os.path.join(paths['data_dir'], 'split3')

os.chdir(paths['split3'])
kwargs = {'ref': paths['illum_seq_fasta'],
          'ktrim': 'l', 'k': 31, 'mink': 11,
          'hdist': 1}
args = ['tpe tbo -Xmx4g']
# args = ['ref={} ktrim=l k=31 mink=11 hdist=1 tpe tbo -Xmx4g'.format(paths['illum_seq_fasta'])]

trim_dirs = run_bbduk([thyroid_paired], *args, **kwargs)
Starting trimming run with args:
('tpe tbo -Xmx4g',)
['ref=C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\illum_seq.fasta', 'ktrim=l', 'k=31', 'mink=11', 'hdist=1']
...

   1 ['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\ERR030872_1.fastq', 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\ERR030872_2.fastq'] returned an error: 
Executing jgi.BBDukF [tpe, tbo, -Xmx4g, ref=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\illum_seq.fasta, ktrim=l, k=31, mink=11, hdist=1, in1=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\ERR030872_1.fastq, in2=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\ERR030872_2.fastq, out1=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\trim_2018-07-08T18_47_48\ERR030872_1_t.fastq, out2=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\trim_2018-07-08T18_47_48\ERR030872_2_t.fastq]
Version 38.07

maskMiddle was disabled because useShortKmers=true
0.142 seconds.
Initial:
Memory: max=1836m, total=124m, free=108m, used=16m

Added 6422 kmers; time: 	0.123 seconds.
Memory: max=1836m, total=124m, free=101m, used=23m

Input is being processed as paired
Started output streams:	0.043 seconds.
Processing time:   		0.475 seconds.

Input:                  	20000 reads 		1000000 bases.
KTrimmed:               	1 reads (0.01%) 	11 bases (0.00%)
Trimmed by overlap:     	4 reads (0.02%) 	22 bases (0.00%)
Total Removed:          	0 reads (0.00%) 	33 bases (0.00%)
Result:                 	20000 reads (100.00%) 	999967 bases (100.00%)

Time:                         	0.653 seconds.
Reads Processed:       20000 	30.64k reads/sec
Bases Processed:       1000k 	1.53m bases/sec


In [22]:
def fastqs_import(path=None):
    fastq_files = {}
    for file in os.listdir():
        filename, file_extension = os.path.splitext(file)
        if file_extension in ['.fastq', '.fq']:
            with open(file, 'r') as f:
                fastq_files[filename] = SeqIO.to_dict(SeqIO.parse(f, 'fastq'))
    return fastq_files
In [23]:
trimmed_reads = fastqs_import(os.getcwd())
trimmed_reads.keys()
Out[23]:
dict_keys(['ERR030872_1_t', 'ERR030872_2_t'])
In [24]:
# thyroid_seq_t, thyroid_score_t = recover_seq_score(trimmed_reads)
# plot_read_stats(thyroid_seq_t, thyroid_score_t)
# plt.show()

Hmm... That throws an error. Obviously, my custom-written code to return the quality metrics of reads hits hiccups when confronted with reads of variable length. I think it's time to go back to a purpose-built program to achieve this. I've used FastQC in the past, but was looking for some means of displaying any output within a Jupyter notebook, so I can keep the whole pipeline together in this one place. While searching for such an option, I came upon another program called MultiQC. An alternative is fadapa, essentially a wrapper API for Python to parse the output of FastQC.

In [25]:
def run_fastqc(trim_dir, *args, **kwargs):
    os.chdir(trim_dir)
    kwargs_new = []
    for k, v in kwargs.items():
        kwargs_new.append('{}={}'.format(k, v))
    files = []
    for file in os.listdir(trim_dir):
        path = os.path.abspath(file)
        _, extension = os.path.splitext(path)
        if extension in ['.fq', '.fastq']:
            files.append(path)
    if o_s == 'Darwin':
        fastqc = paths['fastqc']
        multiqc = 'multiqc .'
    if o_s == 'Windows':
        fastqc = 'java -cp {};{} \
uk.ac.babraham.FastQC.FastQCApplication'.format(paths['bzip_dir'], paths['fastqc_dir'])
        multiqc = 'python {} .'.format(os.path.join(*[paths['multiqc_dir'], 'scripts', 'multiqc']))
    cmds = [fastqc, *args, *kwargs_new, *files]
    final_cmds = ' '.join(cmds)
#     print(final_cmds)
    p = subprocess.Popen(final_cmds, stdout=subprocess.PIPE, 
                         stderr=subprocess.PIPE, shell=True)
    out, err = p.communicate()
    if p.returncode != 0:
        print('\n returned an error: \n{}\n'.format(
            err.decode('ascii')))
    else:
        print('FastQC run')
    p = subprocess.Popen(multiqc, stdout=subprocess.PIPE, 
                         stderr=subprocess.PIPE, shell=True)
    out, err = p.communicate()
    if p.returncode != 0:
        print('\n returned an error: \n{}\n'.format(
            err.decode('ascii')))
    else:
        print('MultiQC run')
        report = os.path.join(trim_dir, 'multiqc_report.html')
        webbrowser.open(report)
    
In [26]:
run_fastqc(trim_dirs['1'])
FastQC run
MultiQC run

Troubleshooting FastQC

Ok, on 20180621, I tried running the above run_fastqc on the PC, and of course ran into the problem mentioned above, that on Windows it has to be called with the syntax of java -cp C:\somewhere\ uk.ac.FastQC. After doing that, though, it kept returning an error:

Exception in thread "main" java.lang.NoClassDefFoundError: org/itadaki/bzip2/BZip2InputStream
        at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:106)
        at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:62)
        at uk.ac.babraham.FastQC.Analysis.OfflineRunner.processFile(OfflineRunner.java:152)
        at uk.ac.babraham.FastQC.Analysis.OfflineRunner.<init>(OfflineRunner.java:121)
        at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:316)
Caused by: java.lang.ClassNotFoundException: org.itadaki.bzip2.BZip2InputStream
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher\$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 5 more

I fought it for a while, and figured that the core problem was the top line, the org/itadaki..., was some missing dependency of FastQC. The lines below are just a traceback of more recent files that tried to call this module. I asked around at PuPPy Python Programming night, and Pat said that she concurred this was probably the problem. Often times, the relative path provided is a real-life url, i.e., I could try navigating in a browser to itadaki.org. That turned out not to be the case here, though; there was no such website.

Before I asked her, however, I had thought that maybe this was some sort of core dependency that might be part of the standard JDK or JRE distro, and I just had to append my local classpath to the line calling the FastQC classpath (joined by semicolons). This didn't end up being the case; the code is instead more obscure, and had to be hunted down, but the point is that I tried ensuring that the classpath to the core Java modules on my PC were invoked. I couldn't easily find that path on my PC (searching 'Java') into the File Explorer returns way too many results, so I googled for more info, and it said that running the command for %i in (java.exe) do @echo. %~$PATH:i in the terminal, and it did in fact return my local JDK instance. But then, trying to add this to the classpath caused more headaches, because it's in the Program Files dir, and when I try entering that as a path, Python double-slashes the space to escape it, so subprocess ends up complaining about not being able to find 'Files'... That may be specific to subprocess, because when I enter the absolute path without escaping the space, the os.listdir() command does find it.

I asked Pat about this, since she's running on PC, too. She told me about something called Windows SysInternals suite, and downloading that yields a bunch of useful tools, one of which is called junction. Changing to the dir it unzips, then running junction C:\new_name "C:\old name" will create a stable soft linkage (stable in the sense that it will be present after rebooting the OS). So you just run junction C:\Programs "C:\Program Files", and now we can get the JDK dir without any spaces.

Anyways, as I mentioned, Pat also helped me diagnose the initial problem, with the missing org.itadaki code. Copy-pasting that entire path into Google brought up a GitHub account that appeared to have the proper code. I downloaded it and put the expanded repository in the Documents\Data\Python\BioPython dir on the PC. Then I was able to direct the run_fastqc function to the appropriate subdir holding the org dir. Pat says you can then delete all the rest of that repository. Furthermore, it can be risky just downloading and running code from GitHub without having a good idea of what it's supposed to do, as it may contain malicious code.

I was alarmed by this, and pointed out that I hadn't considered GitHub to possess anything other than safe, friendly code before, and questioned how prevalent a problem the other attendees think such an issue might be. They didn't know, but point out that it's a credible risk. There is plenty of discussion of this issue online. But Pat said it was probably ok in this case because the owner of the repository was Yahoo, which is generally considered safe.

In addition to questions of security, however, there are questions of propriety, such as the legitimacy of Yahoo hosting this code, when they didn't appear to be the original owner. A more above-board way to acquire the code would be to not just take the top Google hit, but try to find the definitive version; Pat said one way to hunt this down was to enter into the GitHub search package:itadaki. Another is to follow this advice, and more generally, here is a list of other advanced search keywords for use with GitHub. Of course, there's nothing strictly enforcing such coordination and compliance as would leave a paper trail easy to follow; any author could serruptitiously (or ignorantly, as might happen in my case) copy the code without forking it, then post or deploy it elsewhere. In this case, the Network tab doesn't appear to be available on Yahoo's bzip2 github page. And searching package:itadaki on Github brings up >1K code results, the vast majority of which are in Java. None appear to have the username itadaki; I noted that they're probably mostly academic users who use FastQC or some other like it, found issues with missing dependencies, and decided to post their own local copies. It could be that the original author of the bzip2 package even predates GitHub.

Anyways, copying the code without observing the proper license can get you into legal trouble; Pat said the two main licenses granted in the open-access world are the GNU Public and MIT licenses. She indicated that the former was more restrictive, and it is, but not as much as she seemed to indicate. It is possible to provide open-source software but not permit others the right to host/redistribute it without their permission. Rather, it appears that the most stringent aspect of GPL is its status as copyleft, meaning code derived from it can't use different license terms than the original, and it sounds like this was mostly meant to prevent other authors from trying to restrict or monetize code that was originally meant to be free. The MIT license, on the other hand, has virtually no restrictions, mostly because the author doesn't care to police the use of their creation. And the Yahoo version of the gzip2 package appears to use the MIT license.

All of this is to say that I'm learning more about both Java and Open Source software. In the latter case, the whole idea of forking code from Github, as opposed to merely downloading it, is that you may have the intention of modifying it; in which case, GitHub retains historical info about where the code came from, and who is working on it. This makes it easier to recombine your efforts afterwards, which is generally done via issuing a pull request, which the original author(s) can then decide whether to commit to the main branch (i.e., the most common/authoritative version of the package), or keep separate (maybe as a useful offshoot, designed to do something slightly different from the original package). All of these practices are common in maintaining and developing open-source code between multiple authors.

The bzip2 code had to be compiled before it could be called by run_fastqc, and this is done by navigating to the root dir of the repository and running javac .. Then, specifying the proper java -cp {};{} call in the run_fastqc function has it executing successfully on the PC.


Troubleshooting MultiQC on Windows

But: multiqc still isn't. I got carried away with the Java stuff; it isn't the way to get multiqc to work from the command line on the PC. Rather, if I open C:\Users\DMacKellar\Documents\Python\BioPython\MultiQC\scripts\multiqc in NotePad++, it's clear that's written in Python. But even running python C:\Users\DMacKellar\Documents\Python\BioPython\MultiQC\scripts\multiqc (with or without the terminating space and period, to indicate it's meant to search the current dir) doesn't work immediately; it returns:

Traceback (most recent call last):
  File "C:\Users\DMacKellar\Documents\Python\BioPython\MultiQC\scripts\multiqc", line 89, in <module>
    callback = util_functions.view_all_tags,
AttributeError: module 'multiqc.utils.util_functions' has no attribute 'view_all_tags'

When I open the script in NotePad++ to troubleshoot it, it's clear that the whole thing, while it is in Python, is composed in a complicated way that I don't immediately recognize. Specifically, the multiqc script isn't arranged as a class, but a single function (with nested functions inside it), preceded by a whole lot of what appear to be decorators ('@click.option'), apparently to handle cli-type options that are specified at runtime, to make it behave like a standard bash script. The offending line cited above (#89) is within one of these blocks, apparently meant to handle an input of '--view-tags', which I would note I'm not even invoking here. Anyways, the other file that it's calling, util_functions, does so have a function called 'view_all_tags', but for some reason Python isn't handling it appropriately.

I tried making a copy of the multiqc script in the same dir called multiqc.py (i.e., I didn't change anything other than the extension; calling python multiqc.py now returns:

 Traceback (most recent call last):
 File "C:\Users\DMacKellar\Documents\Python\BioPython\MultiQC\scripts\multiqc.py", line 32, in <module>
 from multiqc import __version__
 File "C:\Users\DMacKellar\Documents\Python\BioPython\MultiQC\scripts\multiqc.py", line 32, in <module>
 from multiqc import __version__
 ImportError: cannot import name '__version__'

The subdir multiqc does indeed lack a file named __version__, but provision for supplying such a variable is offered within the file in that subdir, __init__.py.

I thought about going back to using Fadapa, or making my own parser function for FastQC, but it's too big a pain to be worth abandoning multiqc at this point. Eventually, I got so annoyed that I wrote an issue to the GitHub repo; the author Phil appears to be active recently in addressing them. I'll see what happens.

Contacting the Author of MultiQC

Ok, on 20180623, I saw that Phil Ewell had gotten back to me overnight, and said the most likely problem was mismatched python versions used to install vs run multiqc, which isn't the problem here. He said to try Miniconda, by which I think he meant to use the conda package manager to download and install multiqc, but the bioconda repo for multiqc lacks a Windows version. He also pointed me towards a test script which ran successfully on some sort of Windows/Linux emulator site called Appveyor for testing packages.

I tried stepping through the individual install steps contained in that script (I haven't figured out how to adapt it to run as a Powershell script on Windows), but the archive it downloads appears to lack the multiqc script itself. I reported all of this on Github, and we'll see if he comes back with any other ideas.

As an alternative, I might be able to modify his test script to run on Windows with Powershell. Some example sites giving tutorials for powershell are available.

Phil responded to my response, but without much additional detail. He says that bioconda has no support for Windows (apparently, for any of its packages), in which case I responded with confusion as to why he would suggest that I use Miniconda to install multiqc.

Further, he says that the multiqc script isn't in the downloaded test_data zip file, but that the original repo is also downloaded by the test script. I don't see any line that would accomplish that, however; rather, I figured out that, since the appveyor.yml file is present in the main repo's code itself, he actually meant that he downloads the main repo to the Appveyor instance he's running, then invokes the test script from within it. That's why the test script contains (line #30) a call to python setup.py install without any context as to where that file would've come from; it's being run in the root dir of the entire main MultiQC repo. Anyways, all of this means that the code that was run successfully in his script is not different from that which I've tried and which fails to run.

I eventually went back to the line #89 that python complains about within the multiqc script, however, and looked more closely at the context. I noticed that many of the other tags above and below the @click.option( '--view-tags'... have some of the same args passed to them, but not the 'callback' flag being set in the offending line. In fact, searching for that in the rest of the script returns no hits; this is the only instance of a 'callback' flag being set in the main MultiQC script. So I figured trying commenting it out might help, and it does; the script will now run if called directly in Windows by passing 'python \path_to\multiqc_dir\scripts\multiqc \path_to\data_dir\reads_with_fastqc_reports -o \path_to\data_dir\subdir_to_write_to'. I guessed that something was going wrong with the way he was using the decorator to make an attribute of the view_all_tags function within util_functions file. I passed along this link to a blog post in my response that might be helpful, apparently there are several pitfalls to using decorators in Python when anything changes. In any case, I'm now free to use MultiQC to summarize the FastQC reports on my PC, and hopefully provide a friendlier format to display them within this notebook.

In [27]:
# from IPython.display import HTML

# multiqc_report = r'C:/Users/DMacKellar/Documents/Data/Bio/Bmap/split3/trim_2018-06-25T18_18_00/multiqc_report.html'
# HTML(multiqc_report)

Ok, clearly I'm gonna want to find a more compact or flexible way to represent the output of MultiQC than just to paste the entire report into this notebook. Presumably, this is configurable, but for now I'm more interested in improving the trimming; this output is still poor.

There's one other issue: opening the MultiQC report in the same tab as this notebook ends up screwing up the formatting; Ewell made the MultiQC report HTML pretty fancy, and it contains a lot of subtle touches that override some of the core context that Jupyter expects. To keep this notebook viable, therefore, I'll either have to find a way to make the report's HTML formatting flatter or, more simply, open it in another tab.

I'll go with the latter for now. This post shows some simple code that appears to work for me. This response shows another, more complicated way of doing it, that might be more flexible in the future. I'll just try adding the former code to the run_fastqc function above, so that it automatically open the new report in a separate tab.


Resuming Trimming

Now, to re-run bbduk with different trimming options. The first file actually has pretty good quality per base scores out to about 30bp; but the second file (ERR030872_2_t) has mean scores below 20 up until base 11. Further, the mean quality-per-read scores are unacceptably low for a large number of reads in both files. This might not be too surprising, since the instructions I used for the first run were taken from a section entitled 'adapter trimming', and it didn't do much. I think I'll need to perform different iterations in series, starting with getting a better trimming run.

It appears that it's not recognizing the adapters in the reference file. The bbduk docs say that they include a more comprehensive list of Illumina adapters in their distro under /bbmap/resources/adapters.fa. I'll point to that and try again.

After trying a few different configurations, it looks like a good mix for adapter trimming is to reduce the value of the integers passed to the k and mink parameters. The bbduk docs had said that the value of k should 'be at most the length of the adapters', which I took to mean larger was better, but it's the kmer used to search the adapter sequences against your seq, so specifying a very long match might be too stringent. Values of k and mink around 12 and 3, respectively, seem to reduce most of the overrepresentation of ACTG in the first dozen bases of both reads.

Using the adapters file provided with bbmap causes more reads to be removed than with the custom file I made (with k=12 & mink=3, the % of bases removed is 21% as opposed to 7.5%, respectively. Alternatively, increasing the hamming distance from 1 to 2 improves 5' variability, trimming ~16% of base info.

In [28]:
paths['bbmap_adapt'] = os.path.join(*[paths['bbmap_dir'], 'resources', 'adapters.fa'])

os.chdir(paths['split3'])
kwargs1 = {'ref': paths['bbmap_adapt'],
#           'ref': paths['illum_seq_fasta'],
          'ktrim': 'l', 'k': 15, 'mink': 3,
          'hdist': 1}
kwargs2 = {'ref': paths['illum_seq_fasta'],
          'ktrim': 'l', 'k': 12, 'mink': 3,
          'hdist': 2}
args = ['tbo -Xmx4g']

# run_bbduk([thyroid_paired], *args, **kwargs1)
trim_dirs = run_bbduk([thyroid_paired], *args, **kwargs2)

run_fastqc(trim_dirs['1'])
Starting trimming run with args:
('tbo -Xmx4g',)
['ref=C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\illum_seq.fasta', 'ktrim=l', 'k=12', 'mink=3', 'hdist=2']
...

   1 ['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\ERR030872_1.fastq', 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\ERR030872_2.fastq'] returned an error: 
Executing jgi.BBDukF [tbo, -Xmx4g, ref=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\illum_seq.fasta, ktrim=l, k=12, mink=3, hdist=2, in1=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\ERR030872_1.fastq, in2=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\ERR030872_2.fastq, out1=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\trim_2018-07-08T18_48_05\ERR030872_1_t.fastq, out2=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\trim_2018-07-08T18_48_05\ERR030872_2_t.fastq]
Version 38.07

maskMiddle was disabled because useShortKmers=true
0.073 seconds.
Initial:
Memory: max=1836m, total=124m, free=108m, used=16m

Added 46120 kmers; time: 	0.285 seconds.
Memory: max=1836m, total=124m, free=102m, used=22m

Input is being processed as paired
Started output streams:	0.028 seconds.
Processing time:   		0.345 seconds.

Input:                  	20000 reads 		1000000 bases.
KTrimmed:               	19951 reads (99.76%) 	166412 bases (16.64%)
Trimmed by overlap:     	4 reads (0.02%) 	9 bases (0.00%)
Total Removed:          	778 reads (3.89%) 	166421 bases (16.64%)
Result:                 	19222 reads (96.11%) 	833579 bases (83.36%)

Time:                         	0.669 seconds.
Reads Processed:       20000 	29.91k reads/sec
Bases Processed:       1000k 	1.50m bases/sec


FastQC run
MultiQC run

Running a full pipeline

Next (after adapter trimming), let's try quality trimming. Actually, after reviewing the bbduk docs (also elaborated upon here), I see that the author(s) actually recommend a specific, explicit workflow:

  • Adapter Trimming
  • Quality Trimming
  • Force Trimming
  • Force-Trim Modulo
  • Quality Filtering
  • Kmer Filtering

And additional steps that I don't quite understand, but sound like they're of diminishing importance. I'll try running the 7 steps above in succession using the default args and kwargs listed in the docs, and check the result.


Ok, after having run the entire pipeline with the default params, I see that it's too aggressive. It reduces the input 10,000 reads down to about 5,500, most of which (~3,000) are 40bp for the forward file, and 35bp for the reverse file, but each file also has significant size spikes at every multiple of 5bp in length (30, 25, etc.). Per sequence quality scores and mean scores per base are both high; the data that remain would perform very well, I think. But I'd like to reduce the amount of loss, I think some of the reads/bases being discarded could have value.

According to the printed output below, it seems that the biggest losses are occurring at the steps of Quality trimming (ditches 32% of reads and 45% of the bases) and Force trimming (ditches 17% of reads and 31% of bases). Losses at other steps are marginal. Actually, the Quality trimming loss was partially because I hadn't used the default quality cutoff of 10 from the right side, but 15 from either side. Also, with adapter trimming at the default values it's not really catching anything, which could be part of the problem with quality at later steps. I'll return that step to the params above.

Actually, it looks like the first few bases are still too problematic, not in terms of quality scores necessarily, but in terms of variability in distribution of ATCG values. It's a problem for the first 10 in the first file, but only the first 5 in the second. I think I'll return the force trim left to 10, and add a length filtering step of 20 at the penultimate step.

This approach keeps only ~4800 reads per file, with lengths between 20 and 40bp (mostly on the higher side), but the output quality scores are excellent. I may provide another variant of this workflow that's less aggressive, but this does look kind of like the best way to recover useable data from the thyroid reads. I'll wrap this workflow into a function and scale up with more reads.

In [29]:
# Note: pass in fastq files as a list of lists; pairs together
def run_bbduk_pipeline(fastq_list, trim_dirs=None, use_defaults=True, 
                       args_dict=None, kwargs_dict=None, 
                       keep_intermediate_files=False):
    def return_default_dicts():
        default_kwargs = {'adapter_trimming': {'ref':paths['illum_seq_fasta'], 
                                           'ktrim':'l', 'k':'15', 
                                           'mink':'4', 'hdist':'1'},
                          'quality_trimming': {'qtrim':'rl', 'trimq':'10'},
                          'force_trimming': {'ftl':'10', 'ftr':'139'},
                          'force-trim_modulo':{'ftm':'5'},
                          'quality_filtering':{'maq':'10', 'minlen':'20'},
                          'kmer_filtering':{'ref':paths['phi_x'], 'k':'31', 
                                            'hdist':'1'}
                         }
        default_args = {}
        for k in default_kwargs.keys():
            default_args[k] = []
        return default_args, default_kwargs
    
    if trim_dirs==None:
        trim_dirs = {}
        new_dir = '1'
    else:
        new_dir = max([int(x) for x in trim_dirs.keys()]) + 1
    if o_s == 'Windows':
        bbduk = 'java -cp {}\current jgi.BBDukF'.format(paths['bbmap_dir'])
    elif o_s == 'Darwin':
        bbduk = paths['bbduk']
    else:
        print('Error setting operating system')
    try:
        out_dir
    except NameError:
        ts = datetime.datetime.now().replace(microsecond=0).timestamp()
        read_ts = datetime.datetime.fromtimestamp(ts).isoformat().replace(':', '_')
        out_dir = os.path.join(os.getcwd(), 'trim_{}'.format(read_ts))
        os.mkdir(out_dir)
    else:
        if not os.path.isdir(out_dir):
            os.mkdir(out_dir)
    trim_dirs[str(new_dir)] = str(out_dir)
    os.chdir(out_dir)
    
    if use_defaults:
        args, kwargs = return_default_dicts()
    if args_dict:
        for k, v in args_dict.items():
            for k2, v2 in v.items():
                args[k][k2] = v2
    if kwargs_dict:
        for k, v in kwargs_dict.items():
            for k2, v2 in v.items():
                kwargs[k][k2] = v2
    
    for k, v in kwargs.items():
        for k2, v2 in v.items():
            args[k].append('{}={}'.format(k2, v2))
    
    with open('trim_log.txt', 'w') as f:
        for k, v in args.items():
            f.write('{}: {}\n'.format(k, ' '.join(v)))
    print('Starting trimming run with args:\n')
    for k, v in args.items():
        print('{}: {}'.format(k, ' '.join(v)))
    print('\n\nRead files processed:')
    def get_next_step(arg_list):
        order_of_steps = {'format_conversion': 0, 'adapter_trimming': 1, 
                          'contaminant_filtering': 2, 'nextera_lmp_lib_split': 3,
                          'human_contam_removal': 4, 'quality_trimming': 5, 
                          'force_trimming': 6, 'force-trim_modulo': 7,
                          'quality_filtering': 8, 'kmer_filtering': 9,
                          'kmer_masking': 10, 'quality_recalib': 11, 
                          'deduplication': 12, 'normalization': 13}
        sorted_steps = sorted(arg_list, key=lambda x: order_of_steps[x])
        for step in sorted_steps:
            yield step
        
    for i, fq in enumerate(fastq_list):
        steps_run = []
        steps = get_next_step(args)
        if len(fq) > 1:
            fq_base = [os.path.splitext(os.path.basename(x))[0] for x in fq]
        else:
            fq_base = [os.path.splitext(os.path.basename(*fq))[0]]
        filepaths = [os.path.join(out_dir, x) for x in fq_base]
        for step in steps:
            if len(steps_run) == 0:
                in_suff = ''
                out_suff = '_{}'.format(step)
                for f1, f2 in zip(fq, fq_base):
                    shutil.copy2(f1, '{}.fastq'.format(os.path.join(out_dir, f2)))
            elif len(steps_run) == len(args)-1:
                in_suff = '_{}'.format(steps_run[-1])
                out_suff = '_t'
            else:
                in_suff = '_{}'.format(steps_run[-1])
                out_suff = '_{}'.format(step)
            infiles = ['{}{}'.format(x, in_suff) for x in filepaths]
            outfiles = ['{}{}'.format(x, out_suff) for x in filepaths]
            if len(fq) == 2:
                files = 'in1={0}.fastq in2={1}.fastq out1={2}.fastq out2={3}.fastq'.format(infiles[0], infiles[1], outfiles[0], outfiles[1])
            elif len(fq) == 1:
                files = 'in={0}.fastq out1={1}.fastq'.format(infiles[0], outfiles[0])
            else:
                print('I can\'t understand the input fastq_list; did you pass a nested list?')
            cmds = [bbduk, files, *args[step]]
            final_cmds = ' '.join(cmd for cmd in cmds)
    
            p = subprocess.Popen(final_cmds, stdout=subprocess.PIPE, 
                                 stderr=subprocess.PIPE, shell=True)
            out, err = p.communicate()
            if p.returncode == 0:
                steps_run.append(step)
            else:
                print('{} file: {} step returned error {}'.format(fq[0], step, err.decode('ascii')))
                break
        print(fq_base)
        if keep_intermediate_files == False:
            for f in filepaths:
                os.remove('{}.fastq'.format(f))
                for step in steps_run[:-1]:
                    os.remove('{}_{}.fastq'.format(f, step))
    return trim_dirs
In [30]:
os.chdir(paths['split3'])
trim_dirs = run_bbduk_pipeline([thyroid_paired], trim_dirs=trim_dirs)
Starting trimming run with args:

adapter_trimming: ref=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\illum_seq.fasta ktrim=l k=15 mink=4 hdist=1
quality_trimming: qtrim=rl trimq=10
force_trimming: ftl=10 ftr=139
force-trim_modulo: ftm=5
quality_filtering: maq=10 minlen=20
kmer_filtering: ref=C:/Users/DMacKellar/Documents/Python/BioPython/BBMap/resources/phix174_ill.ref.fa k=31 hdist=1


Read files processed:
['ERR030872_1', 'ERR030872_2']
In [31]:
latest_dir = trim_dirs[str(max(list(int(x) for x in trim_dirs.keys())))]
run_fastqc(latest_dir)
FastQC run
MultiQC run

Ok, that seems like an extremely complicated function to run the bbduk steps in a modifiable way, but I've not got a pipeline that should work. Let's apply it to all of the BodyMap read files.

In [32]:
def get_fastqs(input_dir):
    singlets = []
    pairs = []
    for file in os.listdir(input_dir):
        base = os.path.splitext(file)[0]
        ext = os.path.splitext(file)[-1]
        if ext in ['.fq', '.fastq']:
            trim_index = 0
            if base[-2:] == '_t':
                trim_index += -2
            if base[-2+trim_index:0+trim_index] in ['_1', '_2']:
                pairs.append(file)
            else:
                print(trim_index)
                print(base[-2+trim_index:0+trim_index])
                print(base[-2:])
                filepath = os.path.join(input_dir, file)
                singlets.append([filepath])
    pairs2 = sorted(pairs)
    paired = []
    print(pairs)
    for x, y in zip(pairs2[0::2], pairs2[1::2]):
        base_x = os.path.splitext(x)[0][:-2+trim_index]
        base_y = os.path.splitext(y)[0][:-2+trim_index]
        print(base_x, base_y)
        if base_x == base_y:
            filepath_x = os.path.join(input_dir, x)
            filepath_y = os.path.join(input_dir, y)
            paired.append([filepath_x, filepath_y])

    return singlets + paired
In [33]:
def get_fastqs(input_dir):
    singlets = []
    pairs = []
    for file in os.listdir(input_dir):
        base = os.path.splitext(file)[0].rstrip('_t')
        ext = os.path.splitext(file)[1]
        if ext in ['.fq', '.fastq']:
            if base[-2:] in ['_1', '_2']:
                pairs.append(file)
            else:
                filepath = os.path.join(input_dir, file)
                singlets.append([filepath])
    pairs2 = sorted(pairs)
    paired = []
    for x, y in zip(pairs2[0::2], pairs2[1::2]):
        base_x = os.path.splitext(x)[0].rstrip('_t')[:-2]
        base_y = os.path.splitext(y)[0].rstrip('_t')[:-2]
        if base_x == base_y:
            filepath_x = os.path.join(input_dir, x)
            filepath_y = os.path.join(input_dir, y)
            paired.append([filepath_x, filepath_y])

    return singlets + paired
In [34]:
os.chdir(paths['split3'])
bmap_inputs = get_fastqs(paths['split3'])
trim_dirs = run_bbduk_pipeline(bmap_inputs, trim_dirs=trim_dirs)
Starting trimming run with args:

adapter_trimming: ref=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\illum_seq.fasta ktrim=l k=15 mink=4 hdist=1
quality_trimming: qtrim=rl trimq=10
force_trimming: ftl=10 ftr=139
force-trim_modulo: ftm=5
quality_filtering: maq=10 minlen=20
kmer_filtering: ref=C:/Users/DMacKellar/Documents/Python/BioPython/BBMap/resources/phix174_ill.ref.fa k=31 hdist=1


Read files processed:
['ERR030856']
['ERR030857']
['ERR030858']
['ERR030859']
['ERR030860']
['ERR030861']
['ERR030862']
['ERR030863']
['ERR030864']
['ERR030865']
['ERR030866']
['ERR030867']
['ERR030868']
['ERR030869']
['ERR030870']
['ERR030871']
['ERR030888']
['ERR030889']
['ERR030890']
['ERR030891']
['ERR030892']
['ERR030893']
['ERR030894']
['ERR030895']
['ERR030896']
['ERR030897']
['ERR030898']
['ERR030899']
['ERR030900']
['ERR030901']
['ERR030902']
['ERR030903']
['ERR030872_1', 'ERR030872_2']
['ERR030873_1', 'ERR030873_2']
['ERR030874_1', 'ERR030874_2']
['ERR030875_1', 'ERR030875_2']
['ERR030876_1', 'ERR030876_2']
['ERR030877_1', 'ERR030877_2']
['ERR030878_1', 'ERR030878_2']
['ERR030879_1', 'ERR030879_2']
['ERR030880_1', 'ERR030880_2']
['ERR030881_1', 'ERR030881_2']
['ERR030882_1', 'ERR030882_2']
['ERR030883_1', 'ERR030883_2']
['ERR030884_1', 'ERR030884_2']
['ERR030885_1', 'ERR030885_2']
['ERR030886_1', 'ERR030886_2']
['ERR030887_1', 'ERR030887_2']
In [35]:
latest_dir = trim_dirs[str(max(list(int(x) for x in trim_dirs.keys())))]
run_fastqc(latest_dir)
FastQC run
MultiQC run

Ok, that looks good. After lots of wrangling, I'd say I'm ready to align.

Aligning Reads

The classic approach is to use TopHat and Cufflinks. It sounds like some more recent choices are STAR and BBMap. Since I've already used BBDuk, I'll try the latter first.

One major question is whether to align against the human genome or some previously-compiled transcriptome. It sounds like the former approach will catch more of the reads and have lower duplication rates (i.e., places where a read may map multiple times). It looks like, as of 20180629, the latest stable NCBI RefSeq build for the human genome is GRCh38.p12. More generally, the definitive resource for maintaining and iteratively improving human genome assembly and annotation data is the Genome Reference Consortium; the 'GRC' part means 'Genome Reference Consortium'. That consortium currently only handles maintaining the standard genomes for 4 organisms, so the first letter after GRC indicates the organism: humans ('h'), zebrafish ('z'), mouse ('m'), and chicken ('g' for 'Gallus'). '38' means that (as of Nov. 2017) this is the 38th such release, and the '.p12' subscript means that this release has been 'patched' a dozen times, representing more minor revisions to the sequence since that version was released.

Each such release contains multiple datasets; transcriptomes and genomes, both including sequence from the Human Genome Sequencing project that have successfully been assembled into super-assemblies that represent whole chromosomes, as well as a genome that includes additional large scaffolds that haven't been successfully assigned to a particular chromosome. The summary data table shows that there are two main kinds of assembly within: the more succinct and the more complete.

The records are hosted in the Genbank database, and therefore outside the reach of the sra_download function that I defined earlier. Instead, you could use NCBI's Batch Entrez website, but the programmatic solution is probably to use BioPython's EFetch. Note that this generally requires an API key affiliated with an NCBI account. Furthermore, while I thought that it would be best to use the full-featured genbank files for the reference genome (to retain annotation info), the bbmap docs state that they only accept fasta or fastq as inputs.

Finally, when it comes to setting up a BBMap run, the seqanswers thread has some useful advice.

In [36]:
paths['grch38'] = os.path.join(paths['data_dir'], 'GRCh38_p12_table.txt')
with open(paths['grch38'], 'r') as f:
    grch38 = pd.read_csv(f, header=62, sep='\t', encoding='utf-8')
print(grch38.shape)
grch38
(595, 10)
Out[36]:
# Sequence-Name Sequence-Role Assigned-Molecule Assigned-Molecule-Location/Type GenBank-Accn Relationship RefSeq-Accn Assembly-Unit Sequence-Length UCSC-style-name
0 1 assembled-molecule 1 Chromosome CM000663.2 = NC_000001.11 Primary Assembly 248956422 chr1
1 2 assembled-molecule 2 Chromosome CM000664.2 = NC_000002.12 Primary Assembly 242193529 chr2
2 3 assembled-molecule 3 Chromosome CM000665.2 = NC_000003.12 Primary Assembly 198295559 chr3
3 4 assembled-molecule 4 Chromosome CM000666.2 = NC_000004.12 Primary Assembly 190214555 chr4
4 5 assembled-molecule 5 Chromosome CM000667.2 = NC_000005.10 Primary Assembly 181538259 chr5
5 6 assembled-molecule 6 Chromosome CM000668.2 = NC_000006.12 Primary Assembly 170805979 chr6
6 7 assembled-molecule 7 Chromosome CM000669.2 = NC_000007.14 Primary Assembly 159345973 chr7
7 8 assembled-molecule 8 Chromosome CM000670.2 = NC_000008.11 Primary Assembly 145138636 chr8
8 9 assembled-molecule 9 Chromosome CM000671.2 = NC_000009.12 Primary Assembly 138394717 chr9
9 10 assembled-molecule 10 Chromosome CM000672.2 = NC_000010.11 Primary Assembly 133797422 chr10
10 11 assembled-molecule 11 Chromosome CM000673.2 = NC_000011.10 Primary Assembly 135086622 chr11
11 12 assembled-molecule 12 Chromosome CM000674.2 = NC_000012.12 Primary Assembly 133275309 chr12
12 13 assembled-molecule 13 Chromosome CM000675.2 = NC_000013.11 Primary Assembly 114364328 chr13
13 14 assembled-molecule 14 Chromosome CM000676.2 = NC_000014.9 Primary Assembly 107043718 chr14
14 15 assembled-molecule 15 Chromosome CM000677.2 = NC_000015.10 Primary Assembly 101991189 chr15
15 16 assembled-molecule 16 Chromosome CM000678.2 = NC_000016.10 Primary Assembly 90338345 chr16
16 17 assembled-molecule 17 Chromosome CM000679.2 = NC_000017.11 Primary Assembly 83257441 chr17
17 18 assembled-molecule 18 Chromosome CM000680.2 = NC_000018.10 Primary Assembly 80373285 chr18
18 19 assembled-molecule 19 Chromosome CM000681.2 = NC_000019.10 Primary Assembly 58617616 chr19
19 20 assembled-molecule 20 Chromosome CM000682.2 = NC_000020.11 Primary Assembly 64444167 chr20
20 21 assembled-molecule 21 Chromosome CM000683.2 = NC_000021.9 Primary Assembly 46709983 chr21
21 22 assembled-molecule 22 Chromosome CM000684.2 = NC_000022.11 Primary Assembly 50818468 chr22
22 X assembled-molecule X Chromosome CM000685.2 = NC_000023.11 Primary Assembly 156040895 chrX
23 Y assembled-molecule Y Chromosome CM000686.2 = NC_000024.10 Primary Assembly 57227415 chrY
24 HSCHR1_CTG1_UNLOCALIZED unlocalized-scaffold 1 Chromosome KI270706.1 = NT_187361.1 Primary Assembly 175055 chr1_KI270706v1_random
25 HSCHR1_CTG2_UNLOCALIZED unlocalized-scaffold 1 Chromosome KI270707.1 = NT_187362.1 Primary Assembly 32032 chr1_KI270707v1_random
26 HSCHR1_CTG3_UNLOCALIZED unlocalized-scaffold 1 Chromosome KI270708.1 = NT_187363.1 Primary Assembly 127682 chr1_KI270708v1_random
27 HSCHR1_CTG4_UNLOCALIZED unlocalized-scaffold 1 Chromosome KI270709.1 = NT_187364.1 Primary Assembly 66860 chr1_KI270709v1_random
28 HSCHR1_CTG5_UNLOCALIZED unlocalized-scaffold 1 Chromosome KI270710.1 = NT_187365.1 Primary Assembly 40176 chr1_KI270710v1_random
29 HSCHR1_CTG6_UNLOCALIZED unlocalized-scaffold 1 Chromosome KI270711.1 = NT_187366.1 Primary Assembly 42210 chr1_KI270711v1_random
... ... ... ... ... ... ... ... ... ... ...
565 HSCHR19LRC_PGF2_CTG3_1 alt-scaffold 19 Chromosome GL949753.2 = NW_003571061.2 ALT_REF_LOCI_8 796479 chr19_GL949753v2_alt
566 HSCHR19_4_CTG3_1 alt-scaffold 19 Chromosome KI270938.1 = NT_187693.1 ALT_REF_LOCI_9 1066800 chr19_KI270938v1_alt
567 HSCHR19KIR_FH15_B_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270882.1 = NT_187636.1 ALT_REF_LOCI_10 248807 chr19_KI270882v1_alt
568 HSCHR19KIR_G085_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270883.1 = NT_187637.1 ALT_REF_LOCI_11 170399 chr19_KI270883v1_alt
569 HSCHR19KIR_G085_BA1_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270884.1 = NT_187638.1 ALT_REF_LOCI_12 157053 chr19_KI270884v1_alt
570 HSCHR19KIR_G248_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270885.1 = NT_187639.1 ALT_REF_LOCI_13 171027 chr19_KI270885v1_alt
571 HSCHR19KIR_G248_BA2_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270886.1 = NT_187640.1 ALT_REF_LOCI_14 204239 chr19_KI270886v1_alt
572 HSCHR19KIR_GRC212_AB_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270887.1 = NT_187641.1 ALT_REF_LOCI_15 209512 chr19_KI270887v1_alt
573 HSCHR19KIR_GRC212_BA1_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270888.1 = NT_187642.1 ALT_REF_LOCI_16 155532 chr19_KI270888v1_alt
574 HSCHR19KIR_LUCE_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270889.1 = NT_187643.1 ALT_REF_LOCI_17 170698 chr19_KI270889v1_alt
575 HSCHR19KIR_LUCE_BDEL_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270890.1 = NT_187644.1 ALT_REF_LOCI_18 184499 chr19_KI270890v1_alt
576 HSCHR19KIR_RSH_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270891.1 = NT_187645.1 ALT_REF_LOCI_19 170680 chr19_KI270891v1_alt
577 HSCHR19KIR_RSH_BA2_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270914.1 = NT_187668.1 ALT_REF_LOCI_20 205194 chr19_KI270914v1_alt
578 HSCHR19KIR_T7526_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270915.1 = NT_187669.1 ALT_REF_LOCI_21 170665 chr19_KI270915v1_alt
579 HSCHR19KIR_T7526_BDEL_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270916.1 = NT_187670.1 ALT_REF_LOCI_22 184516 chr19_KI270916v1_alt
580 HSCHR19KIR_ABC08_A1_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270917.1 = NT_187671.1 ALT_REF_LOCI_23 190932 chr19_KI270917v1_alt
581 HSCHR19KIR_ABC08_AB_HAP_C_P_CTG3_1 alt-scaffold 19 Chromosome KI270918.1 = NT_187672.1 ALT_REF_LOCI_24 123111 chr19_KI270918v1_alt
582 HSCHR19KIR_ABC08_AB_HAP_T_P_CTG3_1 alt-scaffold 19 Chromosome KI270919.1 = NT_187673.1 ALT_REF_LOCI_25 170701 chr19_KI270919v1_alt
583 HSCHR19KIR_FH05_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270920.1 = NT_187674.1 ALT_REF_LOCI_26 198005 chr19_KI270920v1_alt
584 HSCHR19KIR_FH05_B_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270921.1 = NT_187675.1 ALT_REF_LOCI_27 282224 chr19_KI270921v1_alt
585 HSCHR19KIR_FH06_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270922.1 = NT_187676.1 ALT_REF_LOCI_28 187935 chr19_KI270922v1_alt
586 HSCHR19KIR_FH06_BA1_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270923.1 = NT_187677.1 ALT_REF_LOCI_29 189352 chr19_KI270923v1_alt
587 HSCHR19KIR_FH08_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270929.1 = NT_187683.1 ALT_REF_LOCI_30 186203 chr19_KI270929v1_alt
588 HSCHR19KIR_FH08_BAX_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270930.1 = NT_187684.1 ALT_REF_LOCI_31 200773 chr19_KI270930v1_alt
589 HSCHR19KIR_FH13_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270931.1 = NT_187685.1 ALT_REF_LOCI_32 170148 chr19_KI270931v1_alt
590 HSCHR19KIR_FH13_BA2_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270932.1 = NT_187686.1 ALT_REF_LOCI_33 215732 chr19_KI270932v1_alt
591 HSCHR19KIR_FH15_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270933.1 = NT_187687.1 ALT_REF_LOCI_34 170537 chr19_KI270933v1_alt
592 HSCHR19KIR_RP5_B_HAP_CTG3_1 alt-scaffold 19 Chromosome GL000209.2 = NT_113949.2 ALT_REF_LOCI_35 177381 chr19_GL000209v2_alt
593 MT assembled-molecule MT Mitochondrion J01415.2 = NC_012920.1 non-nuclear 16569 chrM
594 HSCHRUN_RANDOM_CTG29 unplaced-scaffold na na KI270752.1 <> na Primary Assembly 27745 chrUn_KI270752v1

595 rows × 10 columns

In [3]:
import json

paths['credentials_json'] = os.path.join(paths['data_dir'], 'credentials.json')
with open(paths['credentials_json'], 'r') as f:
    credentials = json.load(f)
In [38]:
human_chr = grch38['RefSeq-Accn'].loc[:23].tolist()
human_chr.append(grch38['RefSeq-Accn'].loc[593])
human_all = grch38['RefSeq-Accn'].tolist()
# print(sorted(grch38.columns))
# human_chr
In [39]:
from Bio import Entrez

paths['grch38_dir'] = os.path.join(paths['data_dir'], 'grch38')
os.chdir(paths['grch38_dir'])
Entrez.email = credentials['ncbi']['email']
ncbi_api = credentials['ncbi']['api_key']

# sra_download(human_all)
def download_genbank(acc_list, email=Entrez.email, api_key=ncbi_api, 
                     db='nuccore', rettype='fasta', 
                     retmode='text', file_ext='.fasta'):
    for acc in acc_list:
        if not os.path.isfile(acc):
            # Downloading...
            net_handle = Entrez.efetch(db=db, id=acc, rettype=rettype, retmode=retmode)
            out_handle = open('{}{}'.format(acc, file_ext), "w")
            out_handle.write(net_handle.read())
            out_handle.close()
            net_handle.close()
            print("{} downloaded".format(acc))
In [40]:
# # Note: cell commented out after running once;
# # don't want to download all those data again

# download_genbank(human_chr)

Now, I'll need to concatenate these into one large Fasta file, since I believe bbmap requires a single file as reference for mapping, not a dir.

In [41]:
# %%time

# # Note: Comment out this cell after running once

# paths['grch38_dir'] = os.path.join(paths['data_dir'], 'grch38')
# filenames = []

# for f in os.listdir(paths['grch38_dir']):
#     filenames.append(os.path.join(paths['grch38_dir'], f))
# paths['concat'] = os.path.join(paths['grch38_dir'], 'hs.fasta')
# with open(paths['concat'], 'w') as outfile:
#     for fname in filenames:
#         with open(fname) as infile:
#             for line in infile:
#                 outfile.write(line)
#                 outfile.write('\n\n')

Note that the resulting file is very large: around 3.5GB. By comparison to a calculation of the total data content of the human genome, this is very large. Anyways, let's try aligning!

In [42]:
bbmap_path = os.path.join(paths['bbmap_dir'], 'bbmap.sh')
bbmap_path1 = 'java -cp {}\current align2.BBMap'.format(paths['bbmap_dir'])
out_path = os.path.join(*[paths['split3'], 'bbmappings', 'manual'])
in_path = os.path.join(trim_dirs['3'], 'ERR030856_t.fastq')
ref_path = os.path.join(paths['grch38_dir'], 'hs.fasta')
ref_path1 = os.path.join(paths['grch38_dir'], 'NC_000024.10.fasta')
cmd = '{0} -Xmx4G in={1} out={2}.sam ref={3} maxindel=200k ambig=random intronlen=20 xstag=us'.format(bbmap_path1, in_path, out_path, ref_path1)
print(cmd)
cmd1 = '{0} -Xmx7G ref={1} usemodulo'.format(bbmap_path1, ref_path)
print(cmd1)
java -cp C:\Users\DMacKellar\Documents\Python\BioPython\BBMap\current align2.BBMap -Xmx4G in=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\trim_2018-07-08T18_48_43\ERR030856_t.fastq out=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\bbmappings\manual.sam ref=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\grch38\NC_000024.10.fasta maxindel=200k ambig=random intronlen=20 xstag=us
java -cp C:\Users\DMacKellar\Documents\Python\BioPython\BBMap\current align2.BBMap -Xmx7G ref=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\grch38\hs.fasta usemodulo
In [43]:
sra_table.head()
sra_table.loc[sra_table['Run'] == 'ERR030856', 'Sex']
sra_table[sra_table['Run'] == 'ERR030856']
# sra_table.columns
Out[43]:
AvgSpotLen BioSample BioSourceProvider Experiment InsertSize LibraryLayout Library_Name MBases MBytes Run ... DATASTORE_filetype DATASTORE_provider Instrument LibrarySelection LibrarySource LoadDate Organism Platform ReleaseDate SRA_Study
0 100 SAMEA962337 NaN ERX011226 0 SINGLE HCT20170 7290 4098 ERR030856 ... sra ncbi Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC 2014-05-30 Homo sapiens ILLUMINA 2011-03-17 ERP000546

1 rows × 30 columns

Ok, what I'm learning is that my laptop lacks the RAM to build the index file for the whole human genome. Instead, I'll have to run a scaled-down version of the mapping, using single chromosomes. For instance, the index built fine with just NC_000024.10 (the Y chromosome). If desireable, it might be fine to loop through the different chromosome reference files for each read, but given this is mostly a proof-of-concept notebook, I think I'll be ok with just running the reads against a single chromosome.

Actually, the mapping is relatively fast, compared to the indexing operation, at least when dealing with a few thousand reads, so in such a case the more time-permissive operation would be to loop through the chromosomes first, and for each chromosome, align all of the reads to them.

I was stunned at first to run an example alignment of ERR030856 against a reference chromosome, and get the following stats out:

   ------------------   Results   ------------------

Genome:                 1
Key Length:             13
Max Indel:              200000
Minimum Score Ratio:    0.56
Mapping Mode:           normal
Reads Used:             6508    (486300 bases)

Mapping:                4.974 seconds.
Reads/sec:              1308.52
kBases/sec:             97.78


Read 1 data:            pct reads       num reads       pct bases          num bases

mapped:                   4.1026%             267         3.2387%              15750
unambiguous:              1.7210%             112         1.7674%               8595
ambiguous:                2.3817%             155         1.4713%               7155
low-Q discards:           0.0000%               0         0.0000%                  0

perfect best site:        0.2151%              14         0.1141%                555
semiperfect site:         0.2151%              14         0.1141%                555

Match Rate:                   NA               NA        82.6490%              14009
Error Rate:              20.2238%             253        17.3451%               2940
Sub Rate:                19.9840%             250         9.9941%               1694
Del Rate:                 1.5987%              20         7.0796%               1200
Ins Rate:                 1.5987%              20         0.2714%                 46
N Rate:                   0.0799%               1         0.0059%                  1
Splice Rate:              0.3197%               4       (splices at least 20 bp)

Total time:             10.177 seconds.

4% of reads mapping successfully? But then I realized that, of course you wouldn't expect that many reads to align to a single one of the 23 possible chromosomes. Checking the sra_table, I find that those reads were one of the '16-tissue mixture's.

That brings up a good test of accuracy: check the ovaries and testes tissues mappings to the Y chromosome (and maybe one more, for comparison, like thyroid tissue).

In [44]:
bbmap_path = os.path.join(paths['bbmap_dir'], 'bbmap.sh')
bbmap_path1 = 'java -cp {}\current align2.BBMap'.format(paths['bbmap_dir'])
out_path = os.path.join(*[paths['split3'], 'bbmappings', 'manual'])
thyroid_path = os.path.join(trim_dirs['3'], 'ERR030903_t.fastq')
testes_path = os.path.join(trim_dirs['3'], 'ERR030902_t.fastq')
ovaries_path = os.path.join(trim_dirs['3'], 'ERR030901_t.fastq')
ref_path = os.path.join(paths['grch38_dir'], 'hs.fasta')
ref_path1 = os.path.join(paths['grch38_dir'], 'NC_000024.10.fasta')
cmd1 = '{0} -Xmx4G in={1} out={2}.sam ref={3} maxindel=200k ambig=random intronlen=20 xstag=us'.format(bbmap_path1, thyroid_path, out_path, ref_path1)
cmd2 = '{0} -Xmx4G in={1} out={2}.sam ref={3} maxindel=200k ambig=random intronlen=20 xstag=us'.format(bbmap_path1, testes_path, out_path, ref_path1)
cmd3 = '{0} -Xmx4G in={1} out={2}.sam ref={3} maxindel=200k ambig=random intronlen=20 xstag=us'.format(bbmap_path1, ovaries_path, out_path, ref_path1)
print('{}\n{}\n{}'.format(cmd1, cmd2, cmd3))
# cmd1 = '{0} -Xmx4G ref={1}'.format(bbmap_path1, ref_path1)
# print(cmd1)
java -cp C:\Users\DMacKellar\Documents\Python\BioPython\BBMap\current align2.BBMap -Xmx4G in=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\trim_2018-07-08T18_48_43\ERR030903_t.fastq out=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\bbmappings\manual.sam ref=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\grch38\NC_000024.10.fasta maxindel=200k ambig=random intronlen=20 xstag=us
java -cp C:\Users\DMacKellar\Documents\Python\BioPython\BBMap\current align2.BBMap -Xmx4G in=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\trim_2018-07-08T18_48_43\ERR030902_t.fastq out=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\bbmappings\manual.sam ref=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\grch38\NC_000024.10.fasta maxindel=200k ambig=random intronlen=20 xstag=us
java -cp C:\Users\DMacKellar\Documents\Python\BioPython\BBMap\current align2.BBMap -Xmx4G in=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\trim_2018-07-08T18_48_43\ERR030901_t.fastq out=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\bbmappings\manual.sam ref=C:\Users\DMacKellar\Documents\Data\Bio\Bmap\grch38\NC_000024.10.fasta maxindel=200k ambig=random intronlen=20 xstag=us

For Thyroid, that gives:

` Read 1 data: pct reads num reads pct bases num bases

mapped: 7.1513% 362 5.2927% 12225 `

For Testes, it's:

` Read 1 data: pct reads num reads pct bases num bases

mapped: 5.4195% 290 4.7333% 12620 `

And for Ovaries:

` Read 1 data: pct reads num reads pct bases num bases

mapped: 8.0251% 384 6.1409% 13495 `

Well, obviously that's surprising, and potentially problematic. I'll have to check some references and see if there's any reason to expect that ovaries tissue RNA might map to the Y-chromosome.

Actually, another problem occurs with the idea of mapping to different chromosomes individually: reads might align to multiple sites, and it would be hard to catch. If this happens when using the entire genome as the index, then at least you'll be aware of which reads do so, and rank the different sites as better or worse matches. If you map the reads to individual chromosomes serially, then it might be hard to track which ones have already mapped elsewhere with a better score.

So probably the best approach to do here would be just to choose a single chromosome, map all the reads to that locally (to do some proof-of-principle in this notebook), then do a more comprehensive run on Galaxy, or some other server instance where I'll have more resources. Let's just do Chr1.

...Actually, even Chr1 runs out of memory with -Xmx7G. Let's try Chr20.

In [45]:
grch38
Out[45]:
# Sequence-Name Sequence-Role Assigned-Molecule Assigned-Molecule-Location/Type GenBank-Accn Relationship RefSeq-Accn Assembly-Unit Sequence-Length UCSC-style-name
0 1 assembled-molecule 1 Chromosome CM000663.2 = NC_000001.11 Primary Assembly 248956422 chr1
1 2 assembled-molecule 2 Chromosome CM000664.2 = NC_000002.12 Primary Assembly 242193529 chr2
2 3 assembled-molecule 3 Chromosome CM000665.2 = NC_000003.12 Primary Assembly 198295559 chr3
3 4 assembled-molecule 4 Chromosome CM000666.2 = NC_000004.12 Primary Assembly 190214555 chr4
4 5 assembled-molecule 5 Chromosome CM000667.2 = NC_000005.10 Primary Assembly 181538259 chr5
5 6 assembled-molecule 6 Chromosome CM000668.2 = NC_000006.12 Primary Assembly 170805979 chr6
6 7 assembled-molecule 7 Chromosome CM000669.2 = NC_000007.14 Primary Assembly 159345973 chr7
7 8 assembled-molecule 8 Chromosome CM000670.2 = NC_000008.11 Primary Assembly 145138636 chr8
8 9 assembled-molecule 9 Chromosome CM000671.2 = NC_000009.12 Primary Assembly 138394717 chr9
9 10 assembled-molecule 10 Chromosome CM000672.2 = NC_000010.11 Primary Assembly 133797422 chr10
10 11 assembled-molecule 11 Chromosome CM000673.2 = NC_000011.10 Primary Assembly 135086622 chr11
11 12 assembled-molecule 12 Chromosome CM000674.2 = NC_000012.12 Primary Assembly 133275309 chr12
12 13 assembled-molecule 13 Chromosome CM000675.2 = NC_000013.11 Primary Assembly 114364328 chr13
13 14 assembled-molecule 14 Chromosome CM000676.2 = NC_000014.9 Primary Assembly 107043718 chr14
14 15 assembled-molecule 15 Chromosome CM000677.2 = NC_000015.10 Primary Assembly 101991189 chr15
15 16 assembled-molecule 16 Chromosome CM000678.2 = NC_000016.10 Primary Assembly 90338345 chr16
16 17 assembled-molecule 17 Chromosome CM000679.2 = NC_000017.11 Primary Assembly 83257441 chr17
17 18 assembled-molecule 18 Chromosome CM000680.2 = NC_000018.10 Primary Assembly 80373285 chr18
18 19 assembled-molecule 19 Chromosome CM000681.2 = NC_000019.10 Primary Assembly 58617616 chr19
19 20 assembled-molecule 20 Chromosome CM000682.2 = NC_000020.11 Primary Assembly 64444167 chr20
20 21 assembled-molecule 21 Chromosome CM000683.2 = NC_000021.9 Primary Assembly 46709983 chr21
21 22 assembled-molecule 22 Chromosome CM000684.2 = NC_000022.11 Primary Assembly 50818468 chr22
22 X assembled-molecule X Chromosome CM000685.2 = NC_000023.11 Primary Assembly 156040895 chrX
23 Y assembled-molecule Y Chromosome CM000686.2 = NC_000024.10 Primary Assembly 57227415 chrY
24 HSCHR1_CTG1_UNLOCALIZED unlocalized-scaffold 1 Chromosome KI270706.1 = NT_187361.1 Primary Assembly 175055 chr1_KI270706v1_random
25 HSCHR1_CTG2_UNLOCALIZED unlocalized-scaffold 1 Chromosome KI270707.1 = NT_187362.1 Primary Assembly 32032 chr1_KI270707v1_random
26 HSCHR1_CTG3_UNLOCALIZED unlocalized-scaffold 1 Chromosome KI270708.1 = NT_187363.1 Primary Assembly 127682 chr1_KI270708v1_random
27 HSCHR1_CTG4_UNLOCALIZED unlocalized-scaffold 1 Chromosome KI270709.1 = NT_187364.1 Primary Assembly 66860 chr1_KI270709v1_random
28 HSCHR1_CTG5_UNLOCALIZED unlocalized-scaffold 1 Chromosome KI270710.1 = NT_187365.1 Primary Assembly 40176 chr1_KI270710v1_random
29 HSCHR1_CTG6_UNLOCALIZED unlocalized-scaffold 1 Chromosome KI270711.1 = NT_187366.1 Primary Assembly 42210 chr1_KI270711v1_random
... ... ... ... ... ... ... ... ... ... ...
565 HSCHR19LRC_PGF2_CTG3_1 alt-scaffold 19 Chromosome GL949753.2 = NW_003571061.2 ALT_REF_LOCI_8 796479 chr19_GL949753v2_alt
566 HSCHR19_4_CTG3_1 alt-scaffold 19 Chromosome KI270938.1 = NT_187693.1 ALT_REF_LOCI_9 1066800 chr19_KI270938v1_alt
567 HSCHR19KIR_FH15_B_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270882.1 = NT_187636.1 ALT_REF_LOCI_10 248807 chr19_KI270882v1_alt
568 HSCHR19KIR_G085_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270883.1 = NT_187637.1 ALT_REF_LOCI_11 170399 chr19_KI270883v1_alt
569 HSCHR19KIR_G085_BA1_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270884.1 = NT_187638.1 ALT_REF_LOCI_12 157053 chr19_KI270884v1_alt
570 HSCHR19KIR_G248_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270885.1 = NT_187639.1 ALT_REF_LOCI_13 171027 chr19_KI270885v1_alt
571 HSCHR19KIR_G248_BA2_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270886.1 = NT_187640.1 ALT_REF_LOCI_14 204239 chr19_KI270886v1_alt
572 HSCHR19KIR_GRC212_AB_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270887.1 = NT_187641.1 ALT_REF_LOCI_15 209512 chr19_KI270887v1_alt
573 HSCHR19KIR_GRC212_BA1_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270888.1 = NT_187642.1 ALT_REF_LOCI_16 155532 chr19_KI270888v1_alt
574 HSCHR19KIR_LUCE_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270889.1 = NT_187643.1 ALT_REF_LOCI_17 170698 chr19_KI270889v1_alt
575 HSCHR19KIR_LUCE_BDEL_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270890.1 = NT_187644.1 ALT_REF_LOCI_18 184499 chr19_KI270890v1_alt
576 HSCHR19KIR_RSH_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270891.1 = NT_187645.1 ALT_REF_LOCI_19 170680 chr19_KI270891v1_alt
577 HSCHR19KIR_RSH_BA2_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270914.1 = NT_187668.1 ALT_REF_LOCI_20 205194 chr19_KI270914v1_alt
578 HSCHR19KIR_T7526_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270915.1 = NT_187669.1 ALT_REF_LOCI_21 170665 chr19_KI270915v1_alt
579 HSCHR19KIR_T7526_BDEL_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270916.1 = NT_187670.1 ALT_REF_LOCI_22 184516 chr19_KI270916v1_alt
580 HSCHR19KIR_ABC08_A1_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270917.1 = NT_187671.1 ALT_REF_LOCI_23 190932 chr19_KI270917v1_alt
581 HSCHR19KIR_ABC08_AB_HAP_C_P_CTG3_1 alt-scaffold 19 Chromosome KI270918.1 = NT_187672.1 ALT_REF_LOCI_24 123111 chr19_KI270918v1_alt
582 HSCHR19KIR_ABC08_AB_HAP_T_P_CTG3_1 alt-scaffold 19 Chromosome KI270919.1 = NT_187673.1 ALT_REF_LOCI_25 170701 chr19_KI270919v1_alt
583 HSCHR19KIR_FH05_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270920.1 = NT_187674.1 ALT_REF_LOCI_26 198005 chr19_KI270920v1_alt
584 HSCHR19KIR_FH05_B_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270921.1 = NT_187675.1 ALT_REF_LOCI_27 282224 chr19_KI270921v1_alt
585 HSCHR19KIR_FH06_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270922.1 = NT_187676.1 ALT_REF_LOCI_28 187935 chr19_KI270922v1_alt
586 HSCHR19KIR_FH06_BA1_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270923.1 = NT_187677.1 ALT_REF_LOCI_29 189352 chr19_KI270923v1_alt
587 HSCHR19KIR_FH08_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270929.1 = NT_187683.1 ALT_REF_LOCI_30 186203 chr19_KI270929v1_alt
588 HSCHR19KIR_FH08_BAX_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270930.1 = NT_187684.1 ALT_REF_LOCI_31 200773 chr19_KI270930v1_alt
589 HSCHR19KIR_FH13_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270931.1 = NT_187685.1 ALT_REF_LOCI_32 170148 chr19_KI270931v1_alt
590 HSCHR19KIR_FH13_BA2_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270932.1 = NT_187686.1 ALT_REF_LOCI_33 215732 chr19_KI270932v1_alt
591 HSCHR19KIR_FH15_A_HAP_CTG3_1 alt-scaffold 19 Chromosome KI270933.1 = NT_187687.1 ALT_REF_LOCI_34 170537 chr19_KI270933v1_alt
592 HSCHR19KIR_RP5_B_HAP_CTG3_1 alt-scaffold 19 Chromosome GL000209.2 = NT_113949.2 ALT_REF_LOCI_35 177381 chr19_GL000209v2_alt
593 MT assembled-molecule MT Mitochondrion J01415.2 = NC_012920.1 non-nuclear 16569 chrM
594 HSCHRUN_RANDOM_CTG29 unplaced-scaffold na na KI270752.1 <> na Primary Assembly 27745 chrUn_KI270752v1

595 rows × 10 columns

In [46]:
def run_bbmap(reads_dir, ref_file_path, map_dirs=None, use_defaults=True, get_stats=True, args_list=None, kwargs_dict=None):
    reads_files = get_fastqs(reads_dir)
    for files in reads_files:
        file_base = files[0].split(sep='_')[0]
        if len(files) == 2:
            file_str = 'in1={} in2={} rcs=f'.format(files[0], files[1])
        elif len(files) == 1:
            file_str = 'in={}'.format(files[0])
            
    if map_dirs==None:
        map_dirs = {}
        new_dir = '1'
    else:
        new_dir = max([int(x) for x in trim_dirs.keys()]) + 1
    try:
        out_dir
    except NameError:
        ts = datetime.datetime.now().replace(microsecond=0).timestamp()
        read_ts = datetime.datetime.fromtimestamp(ts).isoformat().replace(':', '_')
        out_dir = os.path.join(os.getcwd(), 'map_{}'.format(read_ts))
        os.mkdir(out_dir)
    else:
        if not os.path.isdir(out_dir):
            os.mkdir(out_dir)
    map_dirs[str(new_dir)] = str(out_dir)
    os.chdir(out_dir)
            
    if o_s == 'Windows':
        bbmap = 'java -cp {}\current align2.BBMap'.format(paths['bbmap_dir'])
    elif o_s == 'Darwin':
        bbmap = os.path.join(paths['bbmap_dir'], 'bbmap.sh')
    
    def return_defaults():
        default_kwargs = {'maxindel': '200k', 'ambig': 'random', 'intronlen': '20', 'xstag': 'us'}
        default_args = ['-Xmx7G']
        return default_args, default_kwargs

    reads = get_fastqs(reads_dir)
    
    ref_base = os.path.splitext(os.path.basename(ref_file_path))[0].replace('.', '_')
    print('Aligning to {}'.format(ref_base))
    for read in reads:
        if use_defaults==True:
            args, kwargs = return_defaults()
        else:
            args, kwargs = [], {}

        if args_list:
            # imperfect, but try to replace any arg matching first 3 chars
            for a1 in args_list:
                for a2 in args:
                    if a1[:3] == a2[:3]:
                        args.remove(a2)
                        args.append(a1)
        if kwargs_dict:
            for k, v in kwargs_dict.items():
                kwargs[k] = v
        args.append('ref={}'.format(ref_file_path))
        if len(read) == 2:
            base_name = [os.path.splitext(os.path.basename(x))[0].rstrip('_t') for x in read]
            files = 'in1={0} in2={1} out={2}_{3}_{4}.sam'.format(read[0], read[1], ref_base, base_name[0], base_name[1])
        elif len(read) == 1:
            base_name = [os.path.splitext(os.path.basename(*read))[0].rstrip('_t')]
            files = 'in={0} out={1}_{2}.sam'.format(read[0], ref_base, base_name[0])
        else:
            print('I can\'t understand the input fastq_list; did you pass a nested list?')
        if get_stats==True:
            for stat in ['covstats', 'covhist', 'bincov']:
                kwargs['{}'.format(stat)] = '{}_{}_{}.txt'.format(stat, ref_base, base_name[0])
            for k, v in kwargs.items():
                args.append('{}={}'.format(k, v))
        cmds = [bbmap, files, *args]
        final_cmds = ' '.join(cmd for cmd in cmds)
    
        p = subprocess.Popen(final_cmds, stdout=subprocess.PIPE, 
                             stderr=subprocess.PIPE, shell=True)
        out, err = p.communicate()
        if p.returncode == 0:
            print('{} mapped'.format(read))
        else:
            print('{} file returned error: {}'.format(read, err.decode('ascii')))
            break

    return map_dirs
In [47]:
%%time

os.chdir(paths['split3'])
map_dirs = run_bbmap(latest_dir, paths['NC_000020_11_fasta'])
Aligning to NC_000020_11
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030856_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030857_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030858_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030859_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030860_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030861_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030862_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030863_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030864_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030865_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030866_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030867_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030868_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030869_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030870_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030871_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030888_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030889_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030890_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030891_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030892_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030893_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030894_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030895_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030896_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030897_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030898_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030899_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030900_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030901_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030902_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030903_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030872_1_t.fastq', 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030872_2_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030873_1_t.fastq', 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030873_2_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030874_1_t.fastq', 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030874_2_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030875_1_t.fastq', 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030875_2_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030876_1_t.fastq', 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030876_2_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030877_1_t.fastq', 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030877_2_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030878_1_t.fastq', 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030878_2_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030879_1_t.fastq', 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030879_2_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030880_1_t.fastq', 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030880_2_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030881_1_t.fastq', 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030881_2_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030882_1_t.fastq', 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030882_2_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030883_1_t.fastq', 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030883_2_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030884_1_t.fastq', 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030884_2_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030885_1_t.fastq', 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030885_2_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030886_1_t.fastq', 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030886_2_t.fastq'] mapped
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030887_1_t.fastq', 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\ERR030887_2_t.fastq'] mapped
Wall time: 16min 53s

Ok, that takes a while to run, but does complete successfully. I'll try MultiQC on the resulting dir to see whether it can wrap any of the stats files output by bbmap.


Parsing BBMap Output

In [48]:
def run_multiqc(in_dir):
    if o_s == 'Darwin':
        multiqc = 'multiqc .'
    if o_s == 'Windows':
        multiqc = 'python {} .'.format(os.path.join(*[paths['multiqc_dir'], 'scripts', 'multiqc']))
    os.chdir(in_dir)
    p = subprocess.Popen(multiqc, stdout=subprocess.PIPE, 
                         stderr=subprocess.PIPE, shell=True)
    out, err = p.communicate()
    if p.returncode != 0:
        print('\n returned an error: \n{}\n'.format(
            err.decode('ascii')))
    else:
        print('MultiQC run')
        report = os.path.join(in_dir, 'multiqc_report.html')
        webbrowser.open(report)
In [49]:
paths['map_dir1'] = r'C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\map_2018-07-02T20_43_53'

run_multiqc(paths['map_dir1'])
MultiQC run

Ok, MultiQC does recognize data in that dir, but the only report it summarizes is the coverage histogram. That's a relatively succinct output; for covstats_NC_000020_11_ERR030856.txt it reads:

In [50]:
paths['example_covstats'] = os.path.join(
    paths['map_dir1'], 'covstats_NC_000020_11_ERR030856.txt'
)

with open(paths['example_covstats'], 'r') as f:
    ex_covstats_df = pd.read_csv(f, sep='\t')
    
ex_covstats_table = sra_table[sra_table['Run'] == 'ERR030856']
ex_covstats_tiss = ex_covstats_table['organism_part']
ex_covstats_len = ex_covstats_table['AvgSpotLen']
print('ERR030856 tissue:\n{}\n\nERR030856 read length:\n{}'.format(
    ex_covstats_tiss, ex_covstats_len)
     )
ex_covstats_df
ERR030856 tissue:
0    16 Tissues mixture
Name: organism_part, dtype: object

ERR030856 read length:
0    100
Name: AvgSpotLen, dtype: int64
Out[50]:
#ID Avg_fold Length Ref_GC Covered_percent Covered_bases Plus_reads Minus_reads Read_GC Median_fold Std_Dev
0 NC_000020.11 Homo sapiens chromosome 20, GRCh3... 0.002 64444167 0.438 0.1904 122717 235 316 0.4428 0 0.05

In other words, for the 100bp reads run made with a mix of RNA from 16 tissues, using just the first 10,000 reads (trimmed down to ~5,000 reads), about 551 reads mapped to the 20th chromosome of the grch38 build of the H. sapiens genome, covering about 122,717bp of the 64,444,167bp in that chromosome, or about 0.19%.

That's the covhist output. BBMap can also output a bincov stats file, which (according to this wiki) instead breaks down the reference file into chunks of different bp-long locations along the chromosome, and tells how many reads mapped to that region, specifically.

In [51]:
paths['example_bincov'] = os.path.join(
    paths['map_dir1'], 'bincov_NC_000020_11_ERR030856.txt'
)

with open(paths['example_bincov'], 'r') as f:
    ex_bincov_df = pd.read_csv(f, sep='\t', header=2)

print(ex_bincov_df['Cov'].describe())
print(ex_bincov_df['Cov'].value_counts().sort_index())
ex_bincov_df.head()
count    64445.000000
mean         0.002023
std          0.039496
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.330000
Name: Cov, dtype: float64
0.00    63943
0.01       10
0.02       72
0.03       37
0.04       26
0.05       24
0.06       27
0.07       19
0.08       45
0.09       78
0.10        3
0.11        1
0.12        2
0.13        6
0.14        4
0.15        1
0.16        1
0.17        7
0.18        3
0.19        3
0.20        1
0.21        1
0.22        3
0.23        2
0.25        3
0.26        2
0.27        1
0.28        3
0.29        1
0.31        1
0.32        1
0.33        2
0.34        2
0.35        2
0.39        1
0.40        2
0.44        2
0.45        1
0.49        1
0.54        1
0.61        1
0.62        2
0.63        1
0.65        2
0.71        2
0.74        1
0.76        1
0.79        1
0.82        2
0.84        1
0.87        1
1.00       80
1.04        1
1.15        1
1.22        1
1.29        1
1.33        1
Name: Cov, dtype: int64
Out[51]:
#RefName Cov Pos RunningPos
0 NC_000020.11 Homo sapiens chromosome 20, GRCh3... 0.0 1000 0
1 NC_000020.11 Homo sapiens chromosome 20, GRCh3... 0.0 2000 1000
2 NC_000020.11 Homo sapiens chromosome 20, GRCh3... 0.0 3000 2000
3 NC_000020.11 Homo sapiens chromosome 20, GRCh3... 0.0 4000 3000
4 NC_000020.11 Homo sapiens chromosome 20, GRCh3... 0.0 5000 4000

In other words, this file shows every kb of Chr20, and tells how much coverage it saw (I'm guessing that, since most of the values are fractional, it's not saying whether any reads mapped within that region at all, but, rather, Cov is a measure of how many of the 1,000bp within that region are represented).

Ok, next is covhist, which looks like it shows pretty much the same info, but the other way around: it shows the number of bp in the whole chromosome that are covered 0, 1, 2, etc. times:

In [52]:
paths['example_covhist'] = os.path.join(paths['map_dir1'], 'covhist_NC_000020_11_ERR030857.txt')

with open(paths['example_covhist'], 'r') as f:
    ex_covhist_df = pd.read_csv(f, sep='\t')

plt.bar(ex_covhist_df['#Coverage'], ex_covhist_df['numBases'])
plt.yscale('log')
plt.ylabel('[Log] Number of bases represented')
plt.xlabel('Coverage')
plt.show()
ex_covhist_df
Out[52]:
#Coverage numBases
0 0 64342285
1 1 98783
2 2 1965
3 3 476
4 4 326
5 5 152
6 6 81
7 7 12
8 8 61
9 9 21
10 10 5

There was also an option to output an additional stats file with the arg basecov=basecov.txt, but it was too big to keep (like, multiple gigabytes). I'll try producing a single one here, though, and parsing its content):

In [53]:
os.chdir(trim_dirs['3'])
if not os.path.exists('dummy'):
    os.mkdir('dummy')
os.chdir('dummy')
shutil.copy2(os.path.join(trim_dirs['3'], 'ERR030856_t.fastq'), 'ERR030856_t.fastq')

kwargs_dict = {'basecov': 'basecov.txt'}
run_bbmap(os.getcwd(), paths['NC_000020_11_fasta'], kwargs_dict=kwargs_dict)

with open('basecov.txt', 'r') as f:
    head = [next(f) for x in range(10)]
    
os.chdir(trim_dirs['3'])
with open('example_basecov.txt', 'w') as f:
    for line in head:
        f.write(line)

shutil.rmtree('dummy')
for line in head:
    print(line)
Aligning to NC_000020_11
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\dummy\\ERR030856_t.fastq'] mapped
#RefName	Pos	Coverage

NC_000020.11 Homo sapiens chromosome 20, GRCh38.p12 Primary Assembly	0	0

NC_000020.11 Homo sapiens chromosome 20, GRCh38.p12 Primary Assembly	1	0

NC_000020.11 Homo sapiens chromosome 20, GRCh38.p12 Primary Assembly	2	0

NC_000020.11 Homo sapiens chromosome 20, GRCh38.p12 Primary Assembly	3	0

NC_000020.11 Homo sapiens chromosome 20, GRCh38.p12 Primary Assembly	4	0

NC_000020.11 Homo sapiens chromosome 20, GRCh38.p12 Primary Assembly	5	0

NC_000020.11 Homo sapiens chromosome 20, GRCh38.p12 Primary Assembly	6	0

NC_000020.11 Homo sapiens chromosome 20, GRCh38.p12 Primary Assembly	7	0

NC_000020.11 Homo sapiens chromosome 20, GRCh38.p12 Primary Assembly	8	0

Ok, so that basecov.txt file is 4.79GB in size; much larger than the input reference fasta. And it appears that BBMap may attempt to generate one such file for every alignment made. That would consume way too much space to be worth it. But, to summarize the output, it looks essentially like the name describing the file. Specifically, for every base in the reference chromosome, it lists the coverage from the reads, explicitly, in human-readable format (except of course for the fact that no human could feasibly have the patience to iteratively search through such an enormous number of lines to retrieve the coverage they want). That info should be recoverable through some series of operations from the output .sam file itself, which is way more compact; there's no justification for keeping those stats output files from BBMap.

Finally, there's the sam file itself. In looking for a viewer for that file format, I cam across a thread suggesting the use of software called IGV, from the Broad Institute. Unfortunately, that program requires bam file format for input, which will require me to re-run bbmap:

In [54]:
os.chdir(trim_dirs['3'])
if not os.path.exists('dummy'):
    os.mkdir('dummy')
os.chdir('dummy')
shutil.copy2(os.path.join(trim_dirs['3'], 'ERR030856_t.fastq'), 'ERR030856_t.fastq')

kwargs_dict = {'out': 'ERR030856.bam'}
run_bbmap(os.getcwd(), paths['NC_000020_11_fasta'], kwargs_dict=kwargs_dict)
Aligning to NC_000020_11
['C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\dummy\\ERR030856_t.fastq'] mapped
Out[54]:
{'1': 'C:\\Users\\DMacKellar\\Documents\\Data\\Bio\\Bmap\\split3\\trim_2018-07-08T18_48_43\\dummy\\map_2018-07-08T19_11_38'}

But even that output returns an error when attempting to build an index within the IGV program:

Error: java.lang.IllegalStateException: Records ERR030856.1457 HWI-BRUNOP16X_0001:1:1:13634:1239#0 length=100 (*:0) should come after ERR030856.1458 HWI-BRUNOP16X_0001:1:1:13718:1241#0 length=100 (NC_000020.11:48,839,770) when sorting with htsjdk.samtools.SAMRecordCoordinateComparator

(Sheesh, I had to copy that whole error message manually, since the stupid window that pops up with that doesn't let you select text.) Anyways, it sounds like SAM and BAM outputs have very specific tools for handling them, and specifically that BAM files must be sorted before they can be indexed. I would have assumed that BBMap would handle that automatically, but apparently there can be problems with that process. Specifically, everything relies upon a suite of programs called SAMTools, but apparently it's not built for Windows. To get the functionality to sort BAM files on Windows, I'll try BOW.

Hmm... No, BOW contains BWA, which is the Burrows-Wheeler Aligner, which is an (older) alternative algorithm for mapping reads in the first place; it should output proper BAM files, but that would obviate all the advantages of running BBMap. But I can't find the source code for SamTools within the BOW archive. I searched for SamTools on my PC, and it did find some dirs within the source code for MultiQC, but the scripts within those folders look like they're designed to call some of the functionality of SamTools, but maybe not recapitulate it directly.

On the other hand, maybe it's worth trying to install samtools on windows anyway. I extracted to C:\Users\DMacKellar\Documents\Python\BioPython\samtools-1.8, and will have to try cmake on it.

I used the bash shell to make samtools on the PC, so the executables appear to be unavailable under the native windows environment. It still took several hours to work the kinks out to get it installed. Now I'll try to use it to sort the bam file.


Well, samtools returns utterly unhelpful errors when I try samtools sort NC...:

samtools sort NC_000020_11_ERR030888.sam hbr^?" ݸ՛ ÓAchI!a# R譊)X6Cu -r\hKhmj:y㸍%$fG)$IJ݂uH]iIHD$jM<DT~$a#Ԃef@-:U͕[?zl)캍{Wº#΅T~[[A!ؽ~yj   BC x^ UXuny֖9S*ioIU~t\ 1H:v'|؄񬵥]GS!Z{Ȯɽޫ;Ozh:ZVi=[xٟ5h4?*,f壓tb/xor9oh0BHciwJp>J-oޫ>v{ŧz-ߘ'F[[\N'Gdzxx7 [E::bgzf_flush] File write failed (wrong size) hbr^?" ݸ՛ ÓAchI!a# R譊)X6Cu -r\hKhmj:y㸍%$fG)$IJ݂uH]iIHD$jM<DT~$a#Ԃef@-:U͕[?zl)캍{Wº#΅T~[[A!ؽ~yj   BC x^ UXuny֖9S*ioIU~t\ 1H:v'|؄񬵥]GS!Z{Ȯɽޫ;Ozh:ZVi=[xٟ5h4?*,f壓tb/xor9oh0BHciwJp>J-oޫ>v{ŧz-ߘ'F[[\N'Gdzxx7 [E::bgzf_close] File write failed samtools sort: failed to create "-": Input/output error

So I figured that maybe the sam file was corrupted or improperly formatted or something, but when I open it in Notepad++ on the PC the file appears to fit the specification. So I figured maybe samtools was somehow not installing properly when using bash, so I spent HOURS trying to get it to compile with various C++ compilers on Windows: Visual Studio C++, MinGW, Mingw-64, etc.

Now I'm not entirely certain how compiling on Windows is supposed to go; you can (in the case of Visual Studio C++) do 'cl', which calls the compiler, then just specify a bunch of filenames to compile, i.e., 'cl blah.cpp blah2.cpp', but there are so many files in this package that that approach seems stupid. The more usual way to compile on UNIX is to do:

./configure
make
make-install

But trying cl configure, cl configure.ac, or cl ./configure.ac keeps returning an error that says:

configure.ac: file not recognized: File format not recognized

So, with MinGW, I tried all the various compilers available: c++, gcc, g++, etc. They all say the same damn thing.

Hours later, I found out there's a wrapper for the basic SAM/BAM functionality in Python, called pysam. But apparently it doesn't support Windows, either: attempting pip install pysam yields:

File "C:\Users\DMACKE~2\AppData\Local\Temp\pip-install-la_s7snp\pysam\setup.py", line 69, in run_make_print_config
stdout = subprocess.check_output(["make", "-s", "print-config"])

And, indeed, make isn't a valid command on Windows. I tried changing that line to cmake, c++, etc., but to no avail.

Anyways, eventually I went back to the bash-specific samtools install, and tried changing the commands, according to this workflow. Finally I found out that running:

samtools sort NC...bam -o NC..._1.bam

Will work. Apparently it was just waiting for me to specify an output file with a different name; it won't perform a sort inplace. So really, I've lost a couple of days of productivity because whoever built samtools couldn't be bothered to have it return any kind of inline error handling that gave you a clue as to its preferred input arg formatting. I should have read more documentation/examples before going down this rabbit hole, but given how poorly & minimally supported bioinformatics software is, especially on Windows, I had no real reason to expect that the thing wasn't fundamentally broken.

Anyways, I think I've now more or less figured out a workflow that should work on the PC for getting these stupid files sorted so I can visualize the outputs of the mappings.

In [55]:
bash = 'bash'
samtools = '/mnt/c/Users/DMacKellar/Documents/Python/BioPython/samtools-1.8/samtools'
map_dir = '/mnt/c/Users/DMacKellar/Documents/Data/Bio/Bmap/split3/map2018-07-02T20_43_53'
In [56]:
def sort_sam(input_dir):
    os.chdir(input_dir)
    sam_files = []
    for file in os.listdir(input_dir):
        if os.path.splitext(file)[-1] == '.sam':
            sam_files.append(file)
            
    if o_s == 'Windows':
        samtools = '/mnt/c/Users/DMacKellar/Documents/Python/BioPython/samtools-1.8/samtools'
        sort_cmds = ['bash']
        index_cmds = ['bash']
        for sam in sam_files[::10]:
            # subprocess can't handle arbitrarily long cmd line strings
            fname = os.path.splitext(os.path.basename(sam))[0]
#             print(fname)
            sort_cmds.append('{0} sort {1} -o {2}.bam'.format(samtools, sam, fname))
            index_cmds.append('{0} index {2}.bam'.format(samtools, sam, fname))
        final_sort_cmds = ';'.join(sort_cmds)
        final_index_cmds = ';'.join(index_cmds)
        p1 = subprocess.Popen(final_sort_cmds, stdout=subprocess.PIPE, 
                         stderr=subprocess.PIPE, shell=True)
        out, err = p1.communicate()
        if p1.returncode != 0:
            print('\n returned an error: \n{}\n'.format(
                err.decode('ascii')))
        else:
            print('Sam files sorted.')
        p2 = subprocess.Popen(final_index_cmds, stdout=subprocess.PIPE, 
                         stderr=subprocess.PIPE, shell=True)
        out, err = p2.communicate()
        if p2.returncode != 0:
            print('\n returned an error: \n{}\n'.format(
                err.decode('ascii')))
        else:
            print('Bam files indexed.')
#         report = os.path.join(in_dir, 'multiqc_report.html')
#         webbrowser.open(report)
In [57]:
map_dir1 = r'C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\map_2018-07-02T20_43_53'
sort_sam(map_dir1)
Sam files sorted.
Bam files indexed.

Ok, that's working, but the IGV viewer tool isn't really showing anything for the output, sorted BAM files. Maybe that's because I used so few reads, and they're expecting a lot higher coverage?

Yeah, that's about it. I have to zoom in to like 75% of the way towards the most zoomed in that I can get to see anything. Anyway, the reads are there. That fine-grained a view of the output of the mapping isn't really that suitable for handling so many different reads and templates as this, however. It's time to move on with stats to try to quantify differential gene expression.


Quantifying Gene Expression

Scaling Up with Galaxy

Given that my laptop's RAM doesn't even successfully index the reference sequence of many human chromosomes with BBMap, it's clear that the scale of this analysis exercise exceeds local computational resources. To get access to more memory and processing power to handle more of the BodyMap data, I could of course turn to popular, (inexpensive) fee-based services like AWS or Google Cloud Compute. An alternative, however, that makes significant computational resources available for free that is geared specifically towards bioinformatics data analysis, is Galaxy. Finally, there are some videos available, derived from an ongoing series of conferences focused on Galaxy.

Galaxy is meant as an approachable, GUI-based platform for non-specialists in computation, and it would be somewhat tedious to reproduce many of the steps I've carried out in this notebook with their manual browser interface, but fortunately they also allow programmatic access via an api, and equally happily, someone has developed Python-specific bindings for that API via a package called BioBlend. Their processes aren't unlimited, however, and so they offer another interface called CloudMan as a Galaxy-ecosystem wrapper to handle especially large jobs, which sounds like it's a wrapper that actually outsources the job to AWS.

I'm having difficulty finding hard limits on the resources made available for free in the Galaxy Main instance, and those that would necessitate additional (fee-based) power. I suppose that I can begin by attempting some data transfer to, and running various-sized jobs on, a Galaxy Main instance, and seeing if they provide any error messages or personalized guidance if I exceed my welcome as a free user. This of course requires an API key.

Actually, wait: I finally found an outline of limits they expect you to obey.

In [5]:
galaxy_api = credentials['galaxy']['api_key']
In [7]:
from bioblend import galaxy

gi = galaxy.GalaxyInstance(url='https://usegalaxy.org/', key=galaxy_api)
gi.config.get_version()
Out[7]:
{'extra': {}, 'version_major': '18.05'}

Exploring Galaxy Structure

Now to start organizing the workspace and downloading datasets. Additional examples of the syntax can be found here. There's also a series of short Jupyter Notebooks showing some of its functionality, apparently produced by one of the developers of BioBlend, here.

Note that Galaxy (as run from within the browser) appears to have the option to get data directly from NCBI SRA, but not from NCBI Nucleotide. It does, however, have the ability to access data directly from the UCSC genome browser, which does have access to the Human Genome GRCh38 build, which I've been using so far in this notebook, so the outputs should be comparable. Unfortunately, figuring out which tracks contain the raw nucleotide sequence of the genome is proving difficult; I'll have to find another source for the reference to which to align the reads.

Furthermore, it looks like the default organization of data within the BioBlend API is to generate/return Libraries of data, which each contain Folders and can have different permissions set.

The organization of the Galaxy Instance returned by BioBlend is a little complex. The Instance returned doesn't contain all of the data available, just the necessary credentials to access it by querying the (distant) Galaxy server. So subsequent querying of the instance and its available data can take significant amounts of time. The default Galaxy Instance returned from BioBlend has several nested classes: folders, genomes, histories, jobs, libraries, workflows. It has a quotas file, too, but that appears not to show you the resources associated with the instance unless you have admin privileges. The documentation on these files/classes isn't great, so I've mostly found few things out by just iteratively running the built-in Python dir() function on each of them (i.e., 'dir(gi.libraries)').

It looks like most of the built-in data are contained in libraries and genomes, which are independent. By scanning through a list of the libraries, you can find out that the Illumina BodyMap data are already uploaded to & accessible to a Galaxy Instance by default:

In [8]:
libs = gi.libraries.get_libraries()
for i, lib in enumerate(libs):
    if 'Illumina' in lib['name']:
        print('{:>2} {} {:<30} {}'.format(i, lib['id'], lib['name'], lib['description']))
21 d0c8e88ab05c469f Illumina iDEA Datasets (sub-sampled) Sub-samapled versions of datasets used for the Illumina iDEA challenge
25 8bb3ab7690e13de8 Illumina BodyMap 2.0           RNA-seq data for the Illumina BodyMap 2.0 project
In [9]:
gi_bmap_lib = libs[25]['id']

Getting Dataset IDs

Within each library are one or more folders; you can retrieve the ids and names for each folder within the gi.libraries class, using the gi.libraries.get_folders() method, but you can't actually see the data within that folder. You can see individual folders by taking one of the ids output by the gi.libraries.get_folders() method and feeding it to the gi.libraries.show_folder() method, but that won't return any more info about it, just the same exact dict output by the get_folders method, but in isolation. To get more info, you have to back out of the gi.libraries subclass and go into gi.folders subclass. If you call the gi.folders.show_folder() method, you can pass it an individual folder id, and it will default to showing you the same data as the gi.libraries.show_folder() method, unless you pass this former method the additional arg contents=True.

This will output a dict. One key, value pair of that dict will be the key 'metadata', whose value is another dict. The other key, value pair output by calling gi.folders.show_folder(folder_id, contents=True) will be the key 'folder_contents', whose value is a list of dicts. But things get trickier. For each dict in the list of dicts corresponding to the key 'folder_contents', the dict will have a key called 'type', and that type can be either 'folder' or 'file'. If the type is 'folder', it will also list the separate ID of that folder. If you call gi.folders.show_folder() on these nested folders, you can see whether they have yet more folders inside of them, or just files.

In [13]:
gi_bmap_folder = gi.libraries.get_folders(gi_bmap_lib)
gi_bmap_folder
Out[13]:
[{'url': '/api/libraries/8bb3ab7690e13de8/contents/Fd795f6d3e169879a',
  'type': 'folder',
  'name': '/',
  'id': 'Fd795f6d3e169879a'}]
In [17]:
folders = gi.folders.show_folder(gi_bmap_folder[0]['id'], contents=True)
folders
Out[17]:
{'folder_contents': [{'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030872_1_thyroid.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'f6a3bd151a3a8748',
   'date_uploaded': '2012-01-16T21:47:58.770405'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030872_2_thyroid.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'ae0b7240c36ccba5',
   'date_uploaded': '2012-01-16T21:47:56.969001'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030873_1_testes.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '7d56c4f017dcd757',
   'date_uploaded': '2012-01-16T21:47:51.490993'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030873_2_testes.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'fde7e9ca25602ee1',
   'date_uploaded': '2012-01-16T21:47:46.860265'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030874_1_ovary.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.3 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '839fe5d1e0e9c165',
   'date_uploaded': '2012-01-16T21:47:52.034962'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030874_2_ovary.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.3 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '519fd9d1242cc1e0',
   'date_uploaded': '2012-01-16T21:47:49.572931'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030875_1_white_blood_cells.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.4 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '3dcae1a1c9673b8f',
   'date_uploaded': '2012-01-16T21:47:49.834471'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030875_2_white_blood_cells.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.4 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '4547b18930bcca99',
   'date_uploaded': '2012-01-16T21:47:48.819835'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030876_1_skeletal_muscle.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '399161604c092ae0',
   'date_uploaded': '2012-01-16T21:47:59.071412'},
  {'update_time': '2012-01-16 09:48 PM',
   'name': 'ERR030876_2_skeletal_muscle.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:48 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '3980dbe474dddda2',
   'date_uploaded': '2012-01-16T21:48:00.271105'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030877_1_prostate.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'b76c0dda2ec23632',
   'date_uploaded': '2012-01-16T21:47:57.250436'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030877_2_prostate.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '31732550962f350e',
   'date_uploaded': '2012-01-16T21:47:46.295875'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030878_1_lymph_node.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'a2eaae02cff788e2',
   'date_uploaded': '2012-01-16T21:47:52.583248'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030878_2_lymph_node.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'd926c582e8b7b2c6',
   'date_uploaded': '2012-01-16T21:47:57.849570'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030879_1_lung.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.1 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'bd99481f96cf6512',
   'date_uploaded': '2012-01-16T21:47:54.553624'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030879_2_lung.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.1 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '3200103516b45c0f',
   'date_uploaded': '2012-01-16T21:47:49.315662'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030880_1_adipose.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '11.8 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'dadcba58fd7e1d61',
   'date_uploaded': '2012-01-16T21:47:55.418221'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030880_2_adipose.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '11.8 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '1f6b28ee654eb129',
   'date_uploaded': '2012-01-16T21:47:51.748967'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030881_1_adrenal.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '11.3 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '24a5812c11b4c1f0',
   'date_uploaded': '2012-01-16T21:47:55.135371'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030881_2_adrenal.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '11.3 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'a1abd22dea5eaaef',
   'date_uploaded': '2012-01-16T21:47:52.298635'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030882_1_brain.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '11.2 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'c101d05bd0f525d1',
   'date_uploaded': '2012-01-16T21:47:58.160521'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030882_2_brain.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '11.2 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '5a32854d8339314a',
   'date_uploaded': '2012-01-16T21:47:50.099875'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030883_1_breast.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '11.6 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '4665dd4441aa475b',
   'date_uploaded': '2012-01-16T21:47:54.276754'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030883_2_breast.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '11.6 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'd039e2757ad0af3c',
   'date_uploaded': '2012-01-16T21:47:57.552267'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030884_1_colon.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.6 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '380b536cca82c11d',
   'date_uploaded': '2012-01-16T21:47:52.866195'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030884_2_colon.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.6 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'c5ef7fa2585b8706',
   'date_uploaded': '2012-01-16T21:47:49.064696'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030885_1_kidney.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.3 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '96fbf82c4429bc02',
   'date_uploaded': '2012-01-16T21:47:56.003574'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030885_2_kidney.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.3 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '7a94aebc4fa0dcdc',
   'date_uploaded': '2012-01-16T21:47:54.830905'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030886_1_heart.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.6 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '40d2284efbf7ef0a',
   'date_uploaded': '2012-01-16T21:47:53.963799'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030886_2_heart.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.6 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'c561ecaddc7fc85d',
   'date_uploaded': '2012-01-16T21:47:47.123906'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030887_1_liver.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.2 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '8f26a0dbd6e9333d',
   'date_uploaded': '2012-01-16T21:47:51.221231'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030887_2_liver.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.2 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '31dea5361ec9421c',
   'date_uploaded': '2012-01-16T21:47:55.710015'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030888_adipose.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '15.2 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'bfcfb3010da4b6a8',
   'date_uploaded': '2012-01-16T21:47:50.653450'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030889_adrenal.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '15.2 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '15ec091f521ad959',
   'date_uploaded': '2012-01-16T21:47:53.144164'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030890_brain.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.8 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '96bbf3272771839b',
   'date_uploaded': '2012-01-16T21:47:50.361833'},
  {'update_time': '2012-01-16 09:48 PM',
   'name': 'ERR030891_breast.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:48 PM',
   'is_private': False,
   'file_size': '15.4 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'c857a19d677604e6',
   'date_uploaded': '2012-01-16T21:48:00.570425'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030892_colon.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '16.0 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'dcf0e672e1333e91',
   'date_uploaded': '2012-01-16T21:47:46.598656'},
  {'update_time': '2012-01-16 09:48 PM',
   'name': 'ERR030893_kidney.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '15.9 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '46df3e684412a294',
   'date_uploaded': '2012-01-16T21:47:59.683890'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030894_heart.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '15.3 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '4fa309035f471214',
   'date_uploaded': '2012-01-16T21:47:58.455682'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030895_liver.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '15.4 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '4e4d02922fcb3933',
   'date_uploaded': '2012-01-16T21:47:56.670498'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030896_lung.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '16.2 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '4df6282897c485c7',
   'date_uploaded': '2012-01-16T21:47:47.389174'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030897_lymph_node.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '16.3 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '432354bd93c21d5f',
   'date_uploaded': '2012-01-16T21:47:48.322755'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030898_prostate.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '16.6 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'c7595b938cd160aa',
   'date_uploaded': '2012-01-16T21:47:59.391636'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030899_skeletal_muscle.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '16.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'fecae70edc3b1987',
   'date_uploaded': '2012-01-16T21:47:48.571176'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030900_white_blood_cells.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '16.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '767530b6e7057080',
   'date_uploaded': '2012-01-16T21:47:53.424273'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030901_ovary.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '16.1 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'f348cffd461a84b2',
   'date_uploaded': '2012-01-16T21:47:56.290547'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030902_testes.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '16.3 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '8f5304a8adf17a54',
   'date_uploaded': '2012-01-16T21:47:48.029373'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030903_thyroid.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '16.0 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '1f711917a9c0c715',
   'date_uploaded': '2012-01-16T21:47:50.945340'}],
 'metadata': {'parent_library_id': '8bb3ab7690e13de8',
  'can_modify_folder': False,
  'folder_description': '',
  'can_add_library_item': False,
  'full_path': [['Fd795f6d3e169879a', 'Illumina BodyMap 2.0']],
  'folder_name': 'Illumina BodyMap 2.0'}}
In [18]:
gi_bmap_files = {}
for folder in folders['folder_contents']:
    gi_bmap_files[folder['name']] = folder['id']
    
print(len(gi_bmap_files))
48

If you get down deep enough in this nested folder hierarchy to find files, you can use the gi.folders.show_folder() method to retrieve specific IDs associated with each individual file. That specific ID is the proper string to identify a 'dataset', even though there's nothing in the formatting of the BioBlend code or docs to really suggest that the two overlap. You can then get metadata about the file by calling gi.datasets.show_dataset() if you pass as args the file id output by the show_folder() call, and pass the arg hda_ldda='ldda'. You can also presumably then access the data locally by passing gi.datasets.download_dataset() by passing that same file id.

Ok, so now I know that the native data for the Illumina BodyMap project are available to Galaxy; now I need to know how to use them. It appears that nothing gets done computationally until you set up a workflow of steps to be done, and then pass gi.workflows.invoke_workflow() to the galaxy instance. What's unclear to me is whether this defaults to running locally (i.e., on my PC, which won't accomplish anything that I couldn't do on my own and with way fewer complications), or whether it schedules it for running on the Galaxy Main server. It would be a good idea to try out a sample task, and submit it.

Ok, more progress: I figured out that using Galaxy is very reliant on the "History" concept: it was largely intended to help standardize and make more reproducible NGS workflows, after all. So you want to go into the current Galaxy Instance's histories class, and return the custom ID that they've assigned for keeping track of your session using the gi.histories.get_current_history() method, then pass that, along with the specific file ID you determined using the libraries and folders classes above, and it'll copy those data (or perhaps an indexer to those data as instantiated on the main Galaxy server) to your Galaxy account; it shows up in the History tab of the GUI in the browser.

Transferring Bmap Datasets

With the history_id and bodymap dataset_ids in hand, you can copy the dataset to your history with the command gi.histories.upload_dataset_from_library(), passing in the two ids.

Now, given that the quota guidelines for individual users' histories on Galaxy Main cap storage space at 200G, and given the large size of the Illumina BodyMap datasets:

In [20]:
sizes = []
for folder in folders['folder_contents']:
    sizes.append(float(folder['file_size'].rstrip(' GB')))
print(sum(sizes))
641.5000000000001

We won't be able to process all of the runs in a single history. In fact, given that qc and trimming of each dataset is likely to increase the space required for it by at least 2 or 3 fold, it seems prudent to only process a handful of runs per history. For now, I'll proceed with completing the pipeline of a single experiment as a trial, and, once a reasonable workflow has been established, scale up.

In [32]:
# # DCM Note: comment out after running once
# gi_hist = gi.histories.get_current_history()['id']
# for k, v in gi_bmap_files.items():
#     if 'thyroid' in k:
#         gi.histories.upload_dataset_from_library(gi_hist, v)

Once you have sent any dataset from a library to your active history, you can see all of the datasets in your history by passing gi.histories.show_history(gi_hist); this will return a list of arbitrary dataset ids, as well as their status ('ok', 'running', etc.), but they won't show additional info to make sense of which dataset has which contents. To get that, pass the additional arg contents=True to the command above, and it'll spit out a list of dicts with dataset names, as well; if you figure out the right identifying fields to pass you can parse this and save IDs as a Python variable that'll ease future access.

In [35]:
gi_hist_datasets = {}
for ds in gi.histories.show_history(gi_hist, contents=True):
    if ds['history_content_type'] == 'dataset' and ds['deleted'] == False:
        gi_hist_datasets[ds['name']] = ds['id']
        
for k, v in gi_hist_datasets.items():
    print('{}: {}'.format(v, k))
bbd44e69cb8906b5a230f5586070575b: UCSC Main on Human: ncbiRefSeq (genome)
bbd44e69cb8906b542d1dd917831d37f: rna.fa.gz
bbd44e69cb8906b55102a7e4360f1415: ERR030872_1_thyroid.fastq
bbd44e69cb8906b5a13249c275239ceb: ERR030872_2_thyroid.fastq
bbd44e69cb8906b546ac835f8c46cf72: ERR030903_thyroid.fastq
In [43]:
sizes = []
for k, ds in gi_hist_datasets.items():
    if 'thyroid' in k:
        d = gi.histories.show_dataset(gi_hist, ds)
        size = float(d['misc_blurb'].rstrip(' Gb'))
        sizes.append(size)
print(sizes)
[12.5, 12.5, 16.0]

Transferring Genome/Transcriptome Data

Unfortunately, I can't find a way to access anything like a unique dataset ID for the GRCh38 genome that's present under the gi.genomes class to send it to the history as well. And UCSC Browser-based copy of that genome that I tried to copy over into Galaxy manually appears to be more of a gene list than the raw genome; the tracks feature within that tool for selecting what info to retain when downloading really didn't make it clear what someone should choose to get just the raw nucleotide sequence. I'll try instead sending from NCBI, using gi.genomes.install_genome().

Nope; I still couldn't find a unique NCBI accession number to have Galaxy download it directly. The in-line help for gi.genomes.install_genome() says that the 'ncbi_name' arg expects 'NCBI's genome identifier', suggesting that the accession number should come specifically from the 'Genome' database within the NCBI site, and I can't find any specific identifier that would accord with GRCh38 in that repository. Furthermore, I expect that, for this project, I'll want to upload not just the raw genome assembly sequence, but will in fact want to align against the predicted human transcriptome. That can be difficult if your organism is obscure; you may have to actually build the putative transcriptome yourself from a genome assembly and lots of RNAseq reads, prior to performing gene quantification with the RNAseq data. But in this case, we're dealing with a well-covered and important organism, whose transcriptome has been derived from many data sources ans built by very well-informed researchers; I doubt that I'm going to learn anything new that they haven't. So in this case I'll just want to download the relevant RNA transcriptome predicted from the GRCh38.p12 assembly.

I found a unique page on NCBI corresponding to that patch, and following links to the annotation report, you can find a link to the ftp site. Reading through the README file in the base dir suggests that the relevant data that I'm looking for are in the /RNA subdir, and by downloading both and checking, I find that the specific file of most utility is probably not the 125MB 'Gnomon_mRNA.fsa.gz', but rather the 70MB rna.fa.gz file.

A note about the transcriptome: the downloaded FASTA file has 159,998 instances of the '>' character, which should indicate that many transcripts are present in the build. This is very close to the sum (160,474) of the mRNAs (113,620) and non-coding transcripts (46,854) listed in the NCBI annotation report for GRCh38.p12. That same report breaks down the transcriptome's content as possessing 54,644 'genes and pseudogenes', including 20,203 protein-coding genes, and 17,871 non-coding gene, and 20,110 genes with variants, meaning nearly every protein-coding gene comes with splice variants.

Unfortunately, it doesn't seem like there's a way to transfer this directly from NCBI to Galaxy, probably because the only method that might suffice, gi.tools.upload_from_ftp(), doesn't really provide documentation on how to pass in credentials to satisfy NCBI. Trying the method with the proper FTP URL and history ID, with no additional params, returns: Unexpected HTTP status code: 400.

Instead, it seems to be prudent to download the transcriptome file to my local machine, then upload it to Galaxy with gi.tools.upload_from_file():

In [164]:
#!/usr/bin/python
# DCM Note: code modified from http://rizwanansari.net/download-all-files-from-ftp-in-python/

import ftplib
import time

kwargs = {'server': 'ftp.ncbi.nlm.nih.gov',
          'user': 'anonymous', 'password': credentials['ncbi']['email'],
          'path': '/genomes/Homo_sapiens/RNA/', 'fname': None, 
          'destination': paths['data_dir']}

def download_ftp(**kwargs):
    server = kwargs['server']
    user = kwargs['user']
    password = kwargs['password']
    path = kwargs['path']
    fname = kwargs['fname']
    destination = kwargs['destination']
    interval = 0.05
    
    ftp = ftplib.FTP(server)
    ftp.login(user, password)
    
    try:
        ftp.cwd(path)       
        os.chdir(destination)
    except OSError:     
        pass
    except ftplib.error_perm:
        print("Error: could not change to " + path)
        sys.exit("Ending Application")
    
    filelist=ftp.nlst()

    for file in filelist:
        if file == fname:
            time.sleep(interval)
            try:
                ftp.retrbinary("RETR " + file, open(os.path.join(destination, file),"wb").write)
                print("Downloaded: " + file)
            except:
                print("Error: File could not be downloaded: " + file)
        else:
            continue
In [46]:
kwargs = {'server': 'ftp.ncbi.nlm.nih.gov', 'path': '/genomes/Homo_sapiens/RNA/', 'password': credentials['ncbi']['email'], 
          'user': 'anonymous', 'destination': paths['data_dir'], 'fname': 'rna.fa.gz'}

download_ftp(**kwargs)
In [52]:
gi_hist = gi.histories.get_current_history()['id']
local_transcriptome = os.path.join(paths['data_dir'], 'rna.fa.gz')

gi.tools.upload_file(local_transcriptome, gi_hist)
Out[52]:
{'outputs': [{'misc_blurb': None,
   'peek': '<table cellspacing="0" cellpadding="3"></table>',
   'update_time': '2018-07-11T02:04:45.942460',
   'data_type': 'galaxy.datatypes.data.Data',
   'tags': [],
   'deleted': False,
   'history_id': '6129cc23f6415d9a',
   'visible': True,
   'genome_build': '?',
   'create_time': '2018-07-11T02:04:45.820710',
   'hid': 21,
   'file_size': 0,
   'file_ext': 'auto',
   'id': 'bbd44e69cb8906b5dc29473996eff3d9',
   'misc_info': None,
   'hda_ldda': 'hda',
   'history_content_type': 'dataset',
   'name': 'rna.fa.gz',
   'uuid': '39241dfa-d6bb-4e1b-bc2f-676ec79243cc',
   'state': 'queued',
   'model_class': 'HistoryDatasetAssociation',
   'metadata_dbkey': '?',
   'output_name': 'output0',
   'purged': False}],
 'implicit_collections': [],
 'jobs': [{'tool_id': 'upload1',
   'update_time': '2018-07-11T02:04:46.199759',
   'exit_code': None,
   'state': 'new',
   'create_time': '2018-07-11T02:04:45.984266',
   'model_class': 'Job',
   'id': 'bbd44e69cb8906b5f0eacd50dd2bef20'}],
 'output_collections': []}

Galaxy QC

Ok! I now have the Illumina BodyMap Thyroid datasets and the GRCh38.p12 transcriptome uploaded to my Galaxy Instance's history. I'm now ready to run the reads through FastQC and determine their quality. Now, the workflow class of BioBlend is pretty complex, and not really meant to be written by hand. Rather, to expedite the process, I'll invoke the job through the browser interface to Galaxy, then export the Workflow/history, and copy it here to allow it to be run programmatically, as well.

FastQC and MultiQC are available by default to Galaxy Main; I'll use both to check the Thyroid reads.

Unfortunately, the jobs take a significant amount of time to run on the Galaxy Main server, so the results will not be available immediately. Instead, the status of the job can be monitored with gi.histories.get_status():

In [59]:
gi.histories.get_status(gi_hist)
Out[59]:
{'state': 'running',
 'state_details': {'paused': 0,
  'ok': 8,
  'failed_metadata': 0,
  'upload': 0,
  'discarded': 0,
  'running': 2,
  'setting_metadata': 0,
  'error': 0,
  'new': 0,
  'queued': 0,
  'empty': 0},
 'percent_complete': 80.0}

It should be possible, from this, to write some kind of function to monitor the status of your Galaxy Instance's History's status, and initiate subsequent jobs automatically when previous steps complete, but that functionality would only be of use while I'm going through the process of building up the workflow. Once a viable workflow has been executed and exported, it should be possible to intiate it with a couple of steps, passing the workflow script to the Galaxy Main server, and trying gi.workflows.invoke_workflow().

Anyways, once the first part is complete, I'll download the workflow.

In [67]:
thyroid_fastqc_workflow_id = gi.workflows.get_workflows()[0]['id']
thyroid_fastqc_workflow = gi.workflows.export_workflow_dict(thyroid_fastqc_workflow_id)
In [78]:
type(thyroid_fastqc_workflow)
thyroid_fastqc_workflow.keys()
steps = []
for k, v in thyroid_fastqc_workflow['steps'].items():
    steps.append(v)
for step in steps:
    print(step['name'])
Input dataset
Input dataset
Input dataset
Input dataset
FastQC
FastQC
FastQC
In [79]:
steps[-3:]
Out[79]:
[{'tool_id': 'toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.72',
  'tool_version': '0.72',
  'outputs': [{'type': 'html', 'name': 'html_file'},
   {'type': 'txt', 'name': 'text_file'}],
  'workflow_outputs': [],
  'input_connections': {'input_file': {'output_name': 'output', 'id': 0}},
  'tool_state': '{"__page__": null, "limits": "null", "input_file": "null", "__rerun_remap_job_id__": null, "contaminants": "null", "chromInfo": "\\"/cvmfs/data.galaxyproject.org/managed/len/ucsc/?.len\\""}',
  'id': 4,
  'tool_shed_repository': {'owner': 'devteam',
   'changeset_revision': 'c15237684a01',
   'name': 'fastqc',
   'tool_shed': 'toolshed.g2.bx.psu.edu'},
  'uuid': 'b2a10d65-1824-47d7-be4e-418e4e8e65dd',
  'errors': None,
  'name': 'FastQC',
  'post_job_actions': {},
  'label': None,
  'inputs': [],
  'position': {'top': 10, 'left': 230},
  'annotation': '',
  'content_id': 'toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.72',
  'type': 'tool'},
 {'tool_id': 'toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.72',
  'tool_version': '0.72',
  'outputs': [{'type': 'html', 'name': 'html_file'},
   {'type': 'txt', 'name': 'text_file'}],
  'workflow_outputs': [],
  'input_connections': {'input_file': {'output_name': 'output', 'id': 1}},
  'tool_state': '{"__page__": null, "limits": "null", "input_file": "null", "__rerun_remap_job_id__": null, "contaminants": "null", "chromInfo": "\\"/cvmfs/data.galaxyproject.org/managed/len/ucsc/?.len\\""}',
  'id': 5,
  'tool_shed_repository': {'owner': 'devteam',
   'changeset_revision': 'c15237684a01',
   'name': 'fastqc',
   'tool_shed': 'toolshed.g2.bx.psu.edu'},
  'uuid': '3b267a6a-751a-4dfb-84c4-2149715851d7',
  'errors': None,
  'name': 'FastQC',
  'post_job_actions': {},
  'label': None,
  'inputs': [],
  'position': {'top': 130, 'left': 230},
  'annotation': '',
  'content_id': 'toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.72',
  'type': 'tool'},
 {'tool_id': 'toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.72',
  'tool_version': '0.72',
  'outputs': [{'type': 'html', 'name': 'html_file'},
   {'type': 'txt', 'name': 'text_file'}],
  'workflow_outputs': [],
  'input_connections': {'input_file': {'output_name': 'output', 'id': 2}},
  'tool_state': '{"__page__": null, "limits": "null", "input_file": "null", "__rerun_remap_job_id__": null, "contaminants": "null", "chromInfo": "\\"/cvmfs/data.galaxyproject.org/managed/len/ucsc/?.len\\""}',
  'id': 6,
  'tool_shed_repository': {'owner': 'devteam',
   'changeset_revision': 'c15237684a01',
   'name': 'fastqc',
   'tool_shed': 'toolshed.g2.bx.psu.edu'},
  'uuid': '01d88c98-cb06-47ca-8ec9-4fbff26cd01a',
  'errors': None,
  'name': 'FastQC',
  'post_job_actions': {},
  'label': None,
  'inputs': [],
  'position': {'top': 250, 'left': 230},
  'annotation': '',
  'content_id': 'toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.72',
  'type': 'tool'}]

That should give a general impression of the number of specific inputs required to specify a simple task like running FastQC on three files. It's not something I want to reproduce manually. Instead, I'll continue with the workflow via the GUI, then export the whole thing at the end.

In the meantime, let's show some plots here, to summarize the thyroid reads' quality:

In [90]:
thyroid_multiqc_plots_dir = r'C:\Users\DMacKellar\Documents\Data\Bio\Bmap\galaxy\thyroid_multiqc_1_files'

thyroid_pngs = []
for file in os.listdir(thyroid_multiqc_plots_dir):
    if file[:7] == 'fastqc_':
        print(file)
        thyroid_pngs.append(os.path.join(thyroid_multiqc_plots_dir, file))
fastqc_adapter_content_plot.png
fastqc_per_base_n_content_plot.png
fastqc_per_base_sequence_quality_plot.png
fastqc_per_sequence_gc_content_plot.png
fastqc_per_sequence_quality_scores_plot.png
fastqc_sequence_duplication_levels_plot.png
In [108]:
from IPython.display import Image

Image(thyroid_pngs[2])
Out[108]:
In [109]:
Image(thyroid_pngs[4])
Out[109]:

So, surprisingly, the data don't look that bad for the thyroid reads, at least when taken as a whole. Contrast this to the lousy results seen when just looking at the first 10,000 reads from any dataset when I was verifying their quality on my local machine. Perhaps the first few spots included in any run are more likely to be skewed to a lower quality when compared to the entirety of the experiment?

Since my main objective with this analysis is to quantification of gene expression across these tissues, I'll also want to make sure that we do see some transcripts more than once. We might not expect the same portion of a transcript to show up in many reads, and therefore sequence duplication levels might not be the most important metric to establish this now, but it still would be good to confirm that there is some amount of sequence duplication present in these reads:

In [110]:
Image(thyroid_pngs[5])
Out[110]:

In any case, even though they look mostly fine, it's still clear that the first dozen or so bases do run into the problem listed before about heavy bias in the identity of different nucleotides (unfortunately, MultiQC says that this type of report plot alone doesn't export as a png, so I had to use printscreen, and thus the quality takes a hit):

In [111]:
per_seq_qual = os.path.join(thyroid_multiqc_plots_dir, 'per_base_seq_grab.png')
Image(per_seq_qual)
Out[111]:

This means that I'll definitely need to trim. Unfortunately, the BBTools suite doesn't appear to be available on Galaxy by default:

In [31]:
gi.jobs.get_jobs()
tools = gi.tools.get_tools()
len(tools)
Out[31]:
1626
In [62]:
tools[0]
Out[62]:
{'panel_section_name': 'NGS: Mothur',
 'description': 'Merge SFF files',
 'labels': ['updated'],
 'edam_operations': [],
 'form_style': 'regular',
 'edam_topics': [],
 'panel_section_id': 'ngs:_mothur',
 'version': '1.36.1.0',
 'link': '/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmothur_merge_sfffiles%2Fmothur_merge_sfffiles%2F1.36.1.0',
 'target': 'galaxy_main',
 'min_width': -1,
 'model_class': 'Tool',
 'id': 'toolshed.g2.bx.psu.edu/repos/iuc/mothur_merge_sfffiles/mothur_merge_sfffiles/1.36.1.0',
 'tool_shed_repository': {'owner': 'iuc',
  'changeset_revision': 'e7f1da3e0339',
  'name': 'mothur_merge_sfffiles',
  'tool_shed': 'toolshed.g2.bx.psu.edu'},
 'name': 'Merge.sfffiles'}
In [32]:
tool_descriptions = {}

for tool in tools:
    tool_descriptions[tool['name']] = tool['description']

tool_descriptions_S = pd.Series(list(tool_descriptions.values()))
tool_descriptions_S.unique().shape
Out[32]:
(776,)
In [127]:
for x in tool_descriptions.keys():
    if 'map' in x:
        print(x)
Filter mapped reads
heatmap2
Heatmap.sim
Heatmap.bin
plotHeatmap
Combine mapped faux paired-end reads
tmap
Heatmap 

Rather, their main tool for editing reads is Trimmomatic.

Actually, there are alternative, simpler tools that can be run in series, which are based on the FastX-Toolkit. The author of that package even includes docs on how to use it in the context of Galaxy. These simpler steps include trimming by length, trimming by quality, filtering by read quality, clipping adapter sequences, removing 'sequencing artifacts', etc. I haven't used this suite of tools before, but am intrigued by having these steps split out explicitly, so I'll try it with the thyroid reads:

The options available in the FastX-Tool for trimming by length include specifying a first base to keep; it may be too aggressive, but I'm going to chop the first dozen bases from every read off. Then I'll use more gentle filtering by read quality (90% of bases within a read must have PHRED>20), clipping adapters, and trying that 'sequencing artifact' removal step.

Actually, after scheduling these steps, I found that the adapter-clipping step had errored out; I wasn't sure how to choose the options for that tool, and apparently failed to supply an input adapter sequence to target. The tool's window has a dropdown menu that is titled 'Source', and has only two options: 'Standard (select from the list below)', and 'Enter custom sequence'. There's an empty field below it that's definitely meant to receive input related to this choice, and it's titled 'Choose adapter'. Rather than feeding it a single adapter sequence, I had gambled that the first option, 'Standard' would pass a default library of adapter sequences, and that these would likely include common Illumina mRNA-seq kit, and further that the '(select from the list below)' part of that line was perhaps misplaced, referring instead to the dropdown window containing that line, since there doesn't seem to be another such 'list' anywhere in the tool's form page that would correspond to such a note.

Instead, the tool had apparently expected me to input a single adapter sequence, because the tool stalled with an error (I'm having difficulty, one day later, tracking down the exact string returned by that error, because I'm unsure of the particular BioBlend command syntax to retrieve it). The man page for the CLI-based version of FastX-Tools indicates that the clipper tool expects a single custom sequence be input at runtime:

[-a ADAPTER] = ADAPTER string. default is CCTTAAGG (dummy adapter).

Anyways, since subsequent steps were dependent upon the output of that tool in order to execute, the rest of the quality control steps stalled, and as a result the run didn't finish overnight, but needs to be re-run. That may be an advantage of substituting a tool like Trimmomatic, which executes as a single pipeline, rather than the separate steps of FastX-Tools. I had checked the Galaxy Main server from my phone after quitting last night, and had seen that the jobs had stalled, but I couldn't supply an adapter easily in that format, so I had tried just deleting the stalled step and the subsequent steps, then calling just the follow-up steps, disregarding the adapter clipping. But I was unable to figure out how to resume the workflow over my phone at the time. In fact, when I checked it again today on my PC I still couldn't find a way to resume it. The workflow steps were highlighted in light blue, saying something like 'this job is paused; select "Resume Paused Jobs" under "History Options" to restart it', but following those instructions had no effect on the workflow. I tried looking for guidance on that particular issue, but it sounds like this may be an acknowledged bug, and my workaround was the most expedient. Too bad.

Anyways, I'm skipping this adapter clipping step for now; I'm unsure whether it may be a legitimate concern... Actually, I just reviewed the MultiQC report on the raw thyroid reads, and it does include a plot called 'Adapter Content', which indicates that all three Thyroid read files passed whatever threshold FastQC applies for that criterion, but the plot does indicate that adapter content increases towards the end of reads up to a maximum of 0.17% of content towards the 3' end of the SE run (around base 64). That's not too bad, but if I can get rid of it it's worth adding one more step to the pipeline. The trace indicates that the identity of the adapter it has detected is the Illumina Universal Adapter. I consulted Google to get the sequence corresponding to this adapter, and found this page. Although it lists the Universal Adapter under two different contexts (the TruSeq and TruSight kits), they appear to have the exact same sequence. So I entered that into the FastX-Tools adapter clip step and tacked that step on to the end of the workflows to modify the Thyroid read files.

Those steps are taking a while to run, especially for the SE run (up to 65bp in length, compared to 50bp each for the PE files, although the input file sizes were about the same). If I check the job history, either through the BioBlend API or the Browser interface, it returns a 'Created' timestamp for the dataset, in UTC, and Google says that 'Coordinated Universal Time is 7 hours ahead of Pacific Time'. This jibes with my memory in the case of the first trim steps, which I initiated around 8:30PM on 20180710; the timestamp says 'Wed 11 Jul 2018 03:24:49 AM (UTC)'. The earliest QC steps today (20180711) were started at ~5:45PM, and are still running two hours later.

In [59]:
gi_hist = gi.histories.get_current_history()['id']
gi.histories.get_status(gi_hist)
# gi.jobs.get_state()
Out[59]:
{'state': 'running',
 'state_details': {'paused': 0,
  'ok': 29,
  'failed_metadata': 0,
  'upload': 0,
  'discarded': 0,
  'running': 1,
  'setting_metadata': 0,
  'error': 0,
  'new': 3,
  'queued': 0,
  'empty': 0},
 'percent_complete': 87.87878787878788}

Post-Trim Quality

Ok, let's check the plots from MultiQC after trimming:

In [13]:
if o_s == 'Darwin':
    thyroid_multiqc_plots_dir_2 = '/Users/drew/data/Bio/galaxy/thyroid_multiqc_2_files'

thyroid_pngs_2 = []
for file in os.listdir(thyroid_multiqc_plots_dir_2):
    if file[:7] == 'fastqc_':
        print(file)
        thyroid_pngs_2.append(os.path.join(thyroid_multiqc_plots_dir_2, file))
fastqc_per_base_seq.png
fastqc_per_base_sequence_quality_plot.png
fastqc_per_sequence_quality_scores_plot.png
fastqc_sequence_duplication_levels_plot.png
In [15]:
from IPython.display import Image

Image(thyroid_pngs_2[1])
Out[15]:
In [19]:
Image(thyroid_pngs_2[2])
Out[19]:
In [20]:
Image(thyroid_pngs_2[0])
Out[20]:
In [21]:
Image(thyroid_pngs_2[-1])
Out[21]:

Those all look pretty good. It's interesting that the representation of sequences that have duplicates actually seems to have increased (i.e., chech out how the peak over the '>10' bin has gotten closer to the '25%' tickmark on the y axis). Presumably, this occurred at the quality filtering step; trimming the reads shouldn't have increased this score. It must be that some of the unique reads were of particularly low quality. This step now actually fails MultiQC's automated cutoff for quality, but I imagine that, if anything, it will make the quantification step more interesting.

Mapping Reads to Transcripts

As Conesa et al. 2016 points out, when mapping against a transcriptome (as opposed to a genome), you can use an ungapped mapper, and Bowtie is the most popular choice, historically. If you're aligning against the genome, you have to take into account the mismatch between the intron-containing nature of genomic sequences and the intron-absent (and alternatively spliced) nature of the cDNA from which the reads were amplified; in this case the most popular choice historically has been TopHat.

But furthermore, when mapping reads which represent such short segments of transcriptomes that possess splice variants, the result will be that many, many of the reads will map to multiple different transcripts in the transcriptome equally well, which complicates quantification of gene expression from RNAseq data.

A most straightforward, naive approach would simply be to admit of all multi-mapped reads, and report how many reads aligned to every single transcript or gene in the reference sequence. But even with this most naive approach, another complication comes from the fact that genes come with different lengths, and read libraries from which counts are drawn vary in size. The difficulty with the latter is obvious (total counts will be higher whenever you have more reads), but the problem with the former is simply that longer genes will, for any given copy number, represent a larger number of total nucleotides in the input cDNA library, and therefore be biased towards being represented in more reads than shorter genes. To make comparisons across different sequencing experiments possible, practitioners have introduced a normalizing parameter, 'reads per kilobase of exon model per million reads', or RPKM. Further, they state that two derivatives of this measure, 'FPKM (fragments per kilobase of exon model per million mapped reads), a within-sample normalized transcript expression measure analogous to RPKs, and TPM (transcripts per million) are the most frequently reported RNA-seq gene expression values'. Finally, and obviously, there are biases in the amplification steps involved in any sequencing protocol, especially in GC content, but also in repetitive sequences or other structural considerations of the template DNA.

As to the proper tools that were built with these issues in mind, I'll just copy verbatim the Conesa_2016 suggestion:

'Algorithms that quantify expression from transcriptome mappings include RSEM (RNA-Seq by Expectation Maximization) [40], eXpress [41], Sailfish [35] and kallisto [42] among others. These methods allocate multi-mapping reads among transcript and output within-sample normalized values corrected for sequencing biases [35, 41, 43]. Additionally, the RSEM algorithm uses an expectation maximization approach that returns TPM values [40]. NURD [44] provides an efficient way of estimating transcript expression from SE reads with a low memory and computing cost.'

Let's check for which of these are already available to the Galaxy Instance, to see what options I'll have going forward:

In [58]:
looking_for = ['RSEM', 'eXpress', 'RNA-Seq by Expectation Maximization', 'Sailfish', 'Kallisto', 'Salmon', 'NURD', 'Bowtie', 'Tophat', 'Cufflinks']

for x in set(tool_descriptions.keys()):
    for looked in looking_for:
        if looked in x:
            print(x)
Convert FASTA to Bowtie color space Index
Convert FASTA to Bowtie base space Index
Tophat
Bowtie2
Cufflinks
Kallisto pseudo
Kallisto quant
Map with Bowtie for Illumina
Salmon
Tophat2
Sailfish

I should note that Galaxy also separates the (substantial number of; 1,626 as of 20180711) tools already compiled and available to run immediately versus a stable of some others (751 in number as of 20180711) that are available to install prior to running, in the 'toolshed':

In [61]:
toolshed = {}

for tool in gi.toolShed.get_repositories():
    toolshed[tool['name']] = tool['id']

print(len(toolshed))
751
In [63]:
for x in set(toolshed.keys()):
    for looked in looking_for:
        if looked in x:
            print(x)
In [64]:
list(toolshed.keys())[:20]
Out[64]:
['allele_counts',
 'analyze_covariates',
 'annotation_profiler',
 'bam_to_sam',
 'bam_to_scidx',
 'bamleftalign',
 'bamtools',
 'bamtools_filter',
 'bamtools_split',
 'basecoverage',
 'bcftools_call',
 'bcftools_view',
 'bedtools',
 'best_regression_subsets',
 'biom_add_metadata',
 'biom_convert',
 'blast_datatypes',
 'blat_coverage_report',
 'blat_mapping',
 'bowtie2']

But apparently, it turns out that none of those missing from the tools list are present in the toolshed.

Additionally, Galaxy itself has a pretty good overview of the ins and outs of the RNAseq pipeline, and they say that RPKM, FPKM, and TPM are not suitable for comparison of expression levels across samples; they're relative measures. This seems to conflict from my reading of Conesa2016. The Galaxy article links to this blog post for further clarification.

All right, I want to set up a first mapping/quantification run, and think I'll try Sailfish first, since it claims to be pretty fast. The Galaxy interface for that tool asks first if you want to use a standard transcriptome/index, or supply your own. I'm doing the latter, with the NCBI version of GRCh38.p12 rna.fa.gz that I uploaded. Next, they ask for 'The size of the k-mer on which the index is built'; the default in Galaxy's interface is 21, but the Sailfish FAQ says that their CLI-based default k-mer count is 31. They also say that it doesn't appear to be that sensitive a parameter, however, so I'll stick with 21 for the first pass. The NCBI Annotation Report doesn't appear to supply any guidance on whether this was a relevant parameter when building the transcriptome and, if so, what value they used.

They also ask, per run, whether the library is SE or PE reads; they don't have an option addressing both, so I guess I'll have to set up separate runs for the two types. In this history, the fully-trimmed SE file is dataset #53: clip on data 50. The paired end runs are datasets #51 & 52.

But a little further down, there's a parameter called 'File containing a mapping of transcripts to genes', and says it's looking for a GTF file, and that it's for calculating 'Gene-level abundance estimations', clearly addressing the point I brought up earlier about the distinction between whether you're counting the number of reads that map to a specific transcript or to the concept of a gene more generally in a genome. The annotation report doesn't mention this, but if you go to the FTP's README file, searching for 'gtf' brings up just 4 hits, in a section titled 'org_transcript.gff.gz and zoo_transcript.gff.gz files', saying that 'These files provide cDNA-to-Genomic, or spliced sequence alignments. These files include same-species and cross-species alignments, respectively. Alignments are generated via the Splign alignment tool'. It claims that those files are located in the 'MAPVIEW' directory, but there is no such directory located in the top level of the FTP site. If I navigate to 'ARCHIVE', 'ANNOTATION_RELEASE.108', there is such a dir, but the files within are all empty (they have a 'size' of 0B). It's the same thing for other ANNOTATION_RELEASEs.

The closest thing I can find appears to be in a subdir called 'GFF', which just contains one file: 'ref_GRCh38.p12_top_level.gff3'. Its contents look vaguely like a list mapping individual genese to genomic coordinates:

NC_000001.11    Curated Genomic pseudogene      131068  134836  .       +       .       ID=gene11;Dbxref=GeneID:100420257,HGNC:HGNC:48835;Name=CICP27;description=capicua transc
riptional repressor pseudogene 27;gbkey=Gene;gene=CICP27;gene_biotype=pseudogene;pseudo=true
NC_000001.11    Curated Genomic exon    131068  132927  .       +       .       ID=id100;Parent=gene11;Dbxref=GeneID:100420257,HGNC:HGNC:48835;gbkey=exon;gene=CICP27
NC_000001.11    Curated Genomic exon    132987  133322  .       +       .       ID=id101;Parent=gene11;Dbxref=GeneID:100420257,HGNC:HGNC:48835;gbkey=exon;gene=CICP27
NC_000001.11    Curated Genomic exon    133733  134058  .       +       .       ID=id102;Parent=gene11;Dbxref=GeneID:100420257,HGNC:HGNC:48835;gbkey=exon;gene=CICP27
NC_000001.11    Curated Genomic exon    134378  134836  .       +       .       ID=id103;Parent=gene11;Dbxref=GeneID:100420257,HGNC:HGNC:48835;gbkey=exon;gene=CICP27
NC_000001.11    BestRefSeq      gene    134773  140566  .       -       .       ID=gene12;Dbxref=GeneID:729737;Name=LOC729737;description=uncharacterized LOC729737;gbkey=Gene;g
ene=LOC729737;gene_biotype=lncRNA
NC_000001.11    BestRefSeq      lnc_RNA 134773  140566  .       -       .       ID=rna24;Parent=gene12;Dbxref=GeneID:729737,Genbank:NR_039983.2;Name=NR_039983.2;gbkey=ncRNA;gen
e=LOC729737;product=uncharacterized LOC729737;transcript_id=NR_039983.2
NC_000001.11    BestRefSeq      exon    140075  140566  .       -       .       ID=id104;Parent=rna24;Dbxref=GeneID:729737,Genbank:NR_039983.2;gbkey=ncRNA;gene=LOC729737;produc
t=uncharacterized LOC729737;transcript_id=NR_039983.2

But it's not easy to tell whether they'll fit the bill. Since, however, as I said I can see no other candidate, I'll try passing this file to the program.

In [24]:
gff = os.path.join(paths['data_dir'], 'ref_GRCh38.p12_top_level.gff3.gz')

gi.tools.upload_file(gff, gi_hist)
Out[24]:
{'outputs': [{'misc_blurb': None,
   'peek': '<table cellspacing="0" cellpadding="3"></table>',
   'update_time': '2018-07-12T06:40:29.390439',
   'data_type': 'galaxy.datatypes.data.Data',
   'tags': [],
   'deleted': False,
   'history_id': '6129cc23f6415d9a',
   'visible': True,
   'genome_build': '?',
   'create_time': '2018-07-12T06:40:29.299134',
   'hid': 65,
   'file_size': 0,
   'file_ext': 'auto',
   'id': 'bbd44e69cb8906b5f01d1fe56f52a610',
   'misc_info': None,
   'hda_ldda': 'hda',
   'history_content_type': 'dataset',
   'name': 'ref_GRCh38.p12_top_level.gff3.gz',
   'uuid': 'c59e3ef1-b9f3-464d-8f53-d7c9a630d8a0',
   'state': 'queued',
   'model_class': 'HistoryDatasetAssociation',
   'metadata_dbkey': '?',
   'output_name': 'output0',
   'purged': False}],
 'implicit_collections': [],
 'jobs': [{'tool_id': 'upload1',
   'update_time': '2018-07-12T06:40:29.568165',
   'exit_code': None,
   'state': 'new',
   'create_time': '2018-07-12T06:40:29.426047',
   'model_class': 'Job',
   'id': 'bbd44e69cb8906b5a1c05a20e188d795'}],
 'output_collections': []}

After this, there were a number of arguments that I left in their default mode. Then there's a field called 'Caculate Effective Lengths', whose default value is 200. I couldn't figure out what this meant, until I read the next field, called 'Standard deviation', whose description made me think it's asking for read lengths. But the Sailfish docs don't mention this parameter, nor the follow-on options. I think I'll leave all of these at their default values and just run for now.

That scheduled; we'll see what the output is. I also want to start a paired-end run with Sailfish, and specified that 'Mate pair 1' is dataset '#51 Clip on data 48', that 'Mate pair 2' is dataset '#52 Clip on data 49', and furthermore that 'Relative orientation of reads within a pair' is 'Mates are oriented toward each other (I = inward)'. I passed the same ref_GRCh38.p12_top_level.gff3.gz file as the 'File containing a mapping of transcripts to genes', left other params as their defaults, and scheduled the run.

While I was setting up the PE run, however, the jobs associated with the SE run turned red in the history, indicating that they had failed. The same issue eventually occurred with the PE run. When I clicked on the jobs, the error message was:

This job was terminated because it used more memory than it was allocated.
Please click the bug icon to report this problem if you need help.

When I check the info associated with the run, however, it shows all of the options I had set, and none of them refer to the amount of memory to allocate.

I googled the error, and found this post, which makes it pretty clear that the dataset is too large for the memory allocated to all free users of the Galaxy Main server. Essentially, the recommendation from the Galaxy admins is to move to using the Galaxy Cloudman server, with AWS support. I'm not sure I'm ready to do this right now, maybe the better option is to split/downsize the read files, and try again? Unfortunately, I don't see any tools available to do this on the Galaxy Main page;

In [36]:
x = 'x'
blah = 'blah_x'
if 'z' or 'y' in blah:
    print(blah)
blah_x
In [23]:
gi_hist = gi.histories.get_current_history()['id']
gi.histories.get_status(gi_hist)
# gi.jobs.get_state()
Out[23]:
{'state': 'ok',
 'state_details': {'paused': 0,
  'ok': 36,
  'failed_metadata': 0,
  'upload': 0,
  'discarded': 0,
  'running': 0,
  'setting_metadata': 0,
  'error': 0,
  'new': 0,
  'queued': 0,
  'empty': 0},
 'percent_complete': 100.0}
In [27]:
# gi.workflows.export_workflow_dict()
errors = gi.histories.get_current_history()['state_ids']['error']
jobs = gi.jobs.get_jobs()
running_jobs = []
errored_jobs = []
for d in jobs:
    if d['state'] == 'running':
        running_jobs.append(d['id'])
    elif d['state'] == 'error':
        errored_jobs.append(d['id'])

# gi.jobs.show_job(errored_jobs[0])
gi.jobs.show_job(running_jobs[0])
Out[27]:
{'tool_id': 'toolshed.g2.bx.psu.edu/repos/devteam/fastx_clipper/cshl_fastx_clipper/1.0.2',
 'update_time': '2018-07-12T01:44:25.104971',
 'inputs': {'input': {'src': 'hda',
   'id': 'bbd44e69cb8906b523cfc7319a5bef0b',
   'uuid': '95f4de37-26b4-4640-b66d-2a7245e512dd'}},
 'outputs': {'output': {'src': 'hda',
   'id': 'bbd44e69cb8906b5d64dd9998898c141',
   'uuid': '1975e162-1f39-4a72-9165-3dea3f892041'}},
 'exit_code': None,
 'state': 'running',
 'create_time': '2018-07-12T01:20:33.125302',
 'params': {'minlength': '"15"',
  'keepdelta': '"0"',
  'clip_source': '{"clip_source_list": "user", "clip_sequence": "AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT", "__current_case__": 0}',
  'KEEP_N': '""',
  'dbkey': '"?"',
  'DISCARD_OPTIONS': '""',
  'chromInfo': '"/cvmfs/data.galaxyproject.org/managed/len/ucsc/?.len"'},
 'model_class': 'Job',
 'id': 'bbd44e69cb8906b5f11a773f7328268a'}
In [75]:
dir(gi.genomes)
for gen in gi.genomes.get_genomes():
    if 'GRCh' in gen[0]:
        print(gen)
        
# gi.genomes.show_genome('hg38Patch11')
['GRCh37.p10 Sep. 2012 (GRCh37.p10/hg19Patch10) (hg19Patch10)', 'hg19Patch10']
['tarInv Dec. 2013 (GRCh38Tar/tarIhg38) (tarIhg38)', 'tarIhg38']
['GRCh38.p6 Dec. 2015 (hg38Patch6)', 'hg38Patch6']
['GRCh38.p7 Mar. 2016 (hg38Patch7)', 'hg38Patch7']
['GRCh38.p5 Sep. 2015 (hg38Patch5)', 'hg38Patch5']
['GRCh38.p2 Dec. 2014 (hg38Patch2)', 'hg38Patch2']
['GRCh38.p3 Apr. 2015 (hg38Patch3)', 'hg38Patch3']
['Human Dec. 2013 (GRCh38/hg38) (hg38)', 'hg38']
['GRCh38.p9 Sep. 2016 (hg38Patch9)', 'hg38Patch9']
['hg19Haplotypes Feb. 2009 (GRCh37/hg19Haps) (hg19Haps)', 'hg19Haps']
['GRCh38.p11 Jun. 2017 (hg38Patch11)', 'hg38Patch11']
['Human Feb. 2009 (GRCh37/hg19) (hg19)', 'hg19']
['GRCh37.p2 Aug. 2009 (GRCh37.p2/hg19Patch2) (hg19Patch2)', 'hg19Patch2']
['GRCh37.p5 Jun. 2011 (GRCh37.p5/hg19Patch5) (hg19Patch5)', 'hg19Patch5']
['GRCh37.p9 Jul. 2012 (GRCh37.p9/hg19Patch9) (hg19Patch9)', 'hg19Patch9']
In [83]:
# gi.genomes.install_genome(source='URL', url_dbkey='ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/Gnomon_mRNA.fsa.gz')
# gi.libraries.create_library('DCM_BMap')
# gi_hist = gi.histories.get_current_history()['id']
# gi.histories.upload_dataset_from_library(gi_hist, 'GRCh38.p11 Jun. 2017 (hg38Patch11)')
gi.genomes.show_genome('GRCh38.p11 Jun. 2017 (hg38Patch11)')
In [88]:
gi_hist
Out[88]:
'6129cc23f6415d9a'
In [97]:
dir(gi.libraries)
for gen in gi.histories.show_history(gi_hist, contents=True):
    print('{}: {}'.format(gen['id'], gen['name']))
# gi.histories.show_history(gi_hist)
bbd44e69cb8906b57cfcf7d5d9606c5f: UCSC Main on Human: ncbiRefSeq (genome)
bbd44e69cb8906b5068614053c14485c: UCSC Main on Human: ncbiRefSeq (genome)
bbd44e69cb8906b5a230f5586070575b: UCSC Main on Human: ncbiRefSeq (genome)
bbd44e69cb8906b5f287bdb518f62c47: UCSC Main on Human: ncbiRefSeq (genome)
bbd44e69cb8906b545bd634a51dab2be: UCSC Main on Human: ncbiRefSeq (genome)
bbd44e69cb8906b5910d39fe793aa0ea: ERR030890 (fastq-dump)
bbd44e69cb8906b545b68d86e45e2c36: ERR030903_thyroid.fastq
f43556be37cf63dd: MultiQC on data 3: Stats
bbd44e69cb8906b56bb6df56f295396e: MultiQC on data 3: Webpage
bbd44e69cb8906b5a9e7fc5449ebbfc2: FastQC on data 3: Webpage
bbd44e69cb8906b534c658209d404136: FastQC on data 3: RawData
1a32d6c7952439ce: MultiQC on data 7 and data 6: Stats
bbd44e69cb8906b5c2997982ba5d30d5: MultiQC on data 7 and data 6: Webpage
9e9ecbf09116be1b: MultiQC on data 7: Stats
bbd44e69cb8906b52e1dd04ee33a9241: MultiQC on data 7: Webpage
bbd44e69cb8906b52cfda6a3a71d15d1: fastqc
bbd44e69cb8906b5cb363ce4b9ed5323: general_stats
bbd44e69cb8906b577a7c40027416a6f: sources
In [109]:
gnomon = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/Gnomon_mRNA.fsa.gz'
# gi.tools.put_url(gnomon, gi_hist)
gi.tools.url()
Out[109]:
{'outputs': [{'misc_blurb': None,
   'peek': '<table cellspacing="0" cellpadding="3"></table>',
   'update_time': '2018-07-09T03:22:35.130424',
   'data_type': 'galaxy.datatypes.data.Data',
   'tags': [],
   'deleted': False,
   'history_id': '6129cc23f6415d9a',
   'visible': True,
   'genome_build': '?',
   'create_time': '2018-07-09T03:22:34.966147',
   'hid': 15,
   'file_size': 0,
   'file_ext': 'auto',
   'id': 'bbd44e69cb8906b5acb279d80243e3e6',
   'misc_info': None,
   'hda_ldda': 'hda',
   'history_content_type': 'dataset',
   'name': 'ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/RNA/Gnomon_mRNA.fsa.gz',
   'uuid': 'c6227de9-ff2d-41ea-9981-353e3a22478d',
   'state': 'queued',
   'model_class': 'HistoryDatasetAssociation',
   'metadata_dbkey': '?',
   'output_name': 'output0',
   'purged': False}],
 'implicit_collections': [],
 'jobs': [{'tool_id': 'upload1',
   'update_time': '2018-07-09T03:22:35.322596',
   'exit_code': None,
   'state': 'new',
   'create_time': '2018-07-09T03:22:35.209132',
   'model_class': 'Job',
   'id': 'bbd44e69cb8906b5a3a72b3948fd5124'}],
 'output_collections': []}
In [128]:
kwargs = {'blah': '1', 'blah2': 2}
def test(blah=None, blah2=None):
    if kwargs:
        for k, v in kwargs.items():
            print(k, v)
            k = v
        print(blah)
            
test(**kwargs)
blah 1
blah2 2
1
In [117]:
local_transcriptome = os.path.join(paths['data_dir'], 'Gnomon_mRNA.fsa.gz')

gi.tools.upload_file(local_transcriptome, gi_hist)
Out[117]:
{'outputs': [{'misc_blurb': None,
   'peek': '<table cellspacing="0" cellpadding="3"></table>',
   'update_time': '2018-07-09T04:10:40.989347',
   'data_type': 'galaxy.datatypes.data.Data',
   'tags': [],
   'deleted': False,
   'history_id': '6129cc23f6415d9a',
   'visible': True,
   'genome_build': '?',
   'create_time': '2018-07-09T04:10:40.859726',
   'hid': 16,
   'file_size': 0,
   'file_ext': 'auto',
   'id': 'bbd44e69cb8906b5fe7d86149cb364a3',
   'misc_info': None,
   'hda_ldda': 'hda',
   'history_content_type': 'dataset',
   'name': 'Gnomon_mRNA.fsa.gz',
   'uuid': '9e63ed95-0715-4809-9d6a-109d64f3aa28',
   'state': 'queued',
   'model_class': 'HistoryDatasetAssociation',
   'metadata_dbkey': '?',
   'output_name': 'output0',
   'purged': False}],
 'implicit_collections': [],
 'jobs': [{'tool_id': 'upload1',
   'update_time': '2018-07-09T04:10:41.220770',
   'exit_code': None,
   'state': 'new',
   'create_time': '2018-07-09T04:10:41.047103',
   'model_class': 'Job',
   'id': 'bbd44e69cb8906b5d1602b418332454a'}],
 'output_collections': []}
In [94]:
gi.datasets.show_dataset('bbd44e69cb8906b545b68d86e45e2c36')
Out[94]:
{'accessible': True,
 'type_id': 'dataset-bbd44e69cb8906b545b68d86e45e2c36',
 'resubmitted': False,
 'create_time': '2018-07-08T03:37:07.012666',
 'creating_job': 'fbc2d1e0eccf013b',
 'dataset_id': '7774608c247bd9c7',
 'file_size': 17144074226,
 'file_ext': 'fastqsanger',
 'id': 'bbd44e69cb8906b545b68d86e45e2c36',
 'misc_info': 'uploaded fastq file',
 'hda_ldda': 'hda',
 'download_url': '/api/histories/6129cc23f6415d9a/contents/bbd44e69cb8906b545b68d86e45e2c36/display',
 'state': 'ok',
 'display_types': [],
 'display_apps': [],
 'metadata_dbkey': '?',
 'type': 'file',
 'misc_blurb': '16.0 Gb',
 'peek': '<table cellspacing="0" cellpadding="3"><tr><td>@ERR030903.1 HWI-BRUNOP16X_0001:8:1:4264:1030#0/1</td></tr><tr><td>NNTCGAGTGAGATATCAGGTTCTATAGAGCGCCTTGAAAAAAGTAATACCTAACCATAAAGTTAAAATCTATGTA</td></tr><tr><td>+</td></tr><tr><td>###########################################################################</td></tr><tr><td>@ERR030903.2 HWI-BRUNOP16X_0001:8:1:6442:1029#0/1</td></tr><tr><td>NNATAGTAATAATCCCCATCCTCCATATATCCAAACAACAAAGCATAACATTTCGCCCACTAAGCCAATCACTTT</td></tr></table>',
 'update_time': '2018-07-08T03:59:33.967559',
 'data_type': 'galaxy.datatypes.sequence.FastqSanger',
 'tags': [],
 'deleted': False,
 'history_id': '6129cc23f6415d9a',
 'meta_files': [],
 'genome_build': '?',
 'metadata_sequences': None,
 'hid': 3,
 'model_class': 'HistoryDatasetAssociation',
 'metadata_data_lines': None,
 'annotation': None,
 'uuid': None,
 'history_content_type': 'dataset',
 'name': 'ERR030903_thyroid.fastq',
 'extension': 'fastqsanger',
 'visible': True,
 'url': '/api/histories/6129cc23f6415d9a/contents/bbd44e69cb8906b545b68d86e45e2c36',
 'visualizations': [{'description': None,
   'embeddable': False,
   'href': '/plugins/interactive_environments/jupyter/show',
   'entry_point': {'type': 'mako', 'attr': {}, 'file': 'jupyter.mako'},
   'groups': None,
   'logo': None,
   'specs': None,
   'target': 'galaxy_main',
   'name': 'jupyter',
   'title': None,
   'settings': None,
   'html': 'Jupyter'}],
 'rerunnable': False,
 'purged': False,
 'api_type': 'file'}
In [62]:
%%time

one_layer_deep = {}
for k, v in folder_ids.items():
    types = []
    for id_ in v:
        output = gi.folders.show_folder(id_, contents=True)
        for d in output['folder_contents']:
            types.append(d['type'])
    one_layer_deep[k] = set(types)
# folder_ids.items()
Wall time: 6min 19s
In [63]:
one_layer_deep
Out[63]:
{'JG Tutorials': {'file', 'folder'},
 'Test Trinity': {'file'},
 'Windshield splatter': {'file'},
 'Evolutionary Trajectories in a Phage': {'file'},
 'Codon Usage Frequencies': {'file'},
 'Sample NGS Datasets': {'file'},
 '1000 Genomes': {'file', 'folder'},
 'mtProjectDemo': {'file'},
 'guru_1000GP': {'folder'},
 'Irish whole genome': {'file'},
 'He-2010': set(),
 'puc18': set(),
 'Kasthuri': set(),
 'mtProject': {'file', 'folder'},
 'AC-exome': set(),
 'the Y': {'file'},
 'Erythroid Epigenetic Landscape': {'file', 'folder'},
 'Heteroplasmy': {'file'},
 'Putative SNP phenotypes': {'file', 'folder'},
 'Genome Diversity': {'file', 'folder'},
 'Variant Detection Demo': {'file', 'folder'},
 'Illumina iDEA Datasets (sub-sampled)': {'file'},
 'Whale Drop': {'file'},
 'Coleman': {'file', 'folder'},
 'GCAT': {'folder'},
 'Illumina BodyMap 2.0': {'file'},
 'Ribosome Profiling Data': {'file'},
 'GATK': {'file', 'folder'},
 'Omics Data': {'file', 'folder'},
 'iGenomes': {'file', 'folder'},
 'XPrize': {'folder'},
 'CloudMap': {'file', 'folder'},
 'Denisovan sequences': {'file', 'folder'},
 'ChIP-Seq Mouse Example': {'file'},
 'Pancreatic Cancer Cell Lines': {'file'},
 'Bushman': {'file', 'folder'},
 'Charts Example Data': {'file'},
 'Demonstration Datasets': {'file', 'folder'},
 'Wolbachia Example': {'file', 'folder'},
 'Mt Study 2014': {'file'},
 'dbSNP': {'file'},
 'Tutorials': {'file', 'folder'},
 'prueba': {'folder'}}
In [64]:
gi.folders.show_folder(*folder_ids['Illumina BodyMap 2.0'], contents=True)
Out[64]:
{'folder_contents': [{'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030872_1_thyroid.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'f6a3bd151a3a8748',
   'date_uploaded': '2012-01-16T21:47:58.770405'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030872_2_thyroid.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'ae0b7240c36ccba5',
   'date_uploaded': '2012-01-16T21:47:56.969001'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030873_1_testes.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '7d56c4f017dcd757',
   'date_uploaded': '2012-01-16T21:47:51.490993'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030873_2_testes.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'fde7e9ca25602ee1',
   'date_uploaded': '2012-01-16T21:47:46.860265'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030874_1_ovary.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.3 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '839fe5d1e0e9c165',
   'date_uploaded': '2012-01-16T21:47:52.034962'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030874_2_ovary.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.3 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '519fd9d1242cc1e0',
   'date_uploaded': '2012-01-16T21:47:49.572931'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030875_1_white_blood_cells.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.4 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '3dcae1a1c9673b8f',
   'date_uploaded': '2012-01-16T21:47:49.834471'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030875_2_white_blood_cells.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.4 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '4547b18930bcca99',
   'date_uploaded': '2012-01-16T21:47:48.819835'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030876_1_skeletal_muscle.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '399161604c092ae0',
   'date_uploaded': '2012-01-16T21:47:59.071412'},
  {'update_time': '2012-01-16 09:48 PM',
   'name': 'ERR030876_2_skeletal_muscle.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:48 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '3980dbe474dddda2',
   'date_uploaded': '2012-01-16T21:48:00.271105'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030877_1_prostate.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'b76c0dda2ec23632',
   'date_uploaded': '2012-01-16T21:47:57.250436'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030877_2_prostate.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '31732550962f350e',
   'date_uploaded': '2012-01-16T21:47:46.295875'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030878_1_lymph_node.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'a2eaae02cff788e2',
   'date_uploaded': '2012-01-16T21:47:52.583248'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030878_2_lymph_node.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'd926c582e8b7b2c6',
   'date_uploaded': '2012-01-16T21:47:57.849570'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030879_1_lung.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.1 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'bd99481f96cf6512',
   'date_uploaded': '2012-01-16T21:47:54.553624'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030879_2_lung.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.1 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '3200103516b45c0f',
   'date_uploaded': '2012-01-16T21:47:49.315662'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030880_1_adipose.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '11.8 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'dadcba58fd7e1d61',
   'date_uploaded': '2012-01-16T21:47:55.418221'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030880_2_adipose.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '11.8 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '1f6b28ee654eb129',
   'date_uploaded': '2012-01-16T21:47:51.748967'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030881_1_adrenal.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '11.3 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '24a5812c11b4c1f0',
   'date_uploaded': '2012-01-16T21:47:55.135371'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030881_2_adrenal.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '11.3 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'a1abd22dea5eaaef',
   'date_uploaded': '2012-01-16T21:47:52.298635'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030882_1_brain.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '11.2 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'c101d05bd0f525d1',
   'date_uploaded': '2012-01-16T21:47:58.160521'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030882_2_brain.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '11.2 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '5a32854d8339314a',
   'date_uploaded': '2012-01-16T21:47:50.099875'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030883_1_breast.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '11.6 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '4665dd4441aa475b',
   'date_uploaded': '2012-01-16T21:47:54.276754'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030883_2_breast.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '11.6 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'd039e2757ad0af3c',
   'date_uploaded': '2012-01-16T21:47:57.552267'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030884_1_colon.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.6 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '380b536cca82c11d',
   'date_uploaded': '2012-01-16T21:47:52.866195'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030884_2_colon.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.6 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'c5ef7fa2585b8706',
   'date_uploaded': '2012-01-16T21:47:49.064696'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030885_1_kidney.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.3 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '96fbf82c4429bc02',
   'date_uploaded': '2012-01-16T21:47:56.003574'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030885_2_kidney.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.3 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '7a94aebc4fa0dcdc',
   'date_uploaded': '2012-01-16T21:47:54.830905'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030886_1_heart.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.6 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '40d2284efbf7ef0a',
   'date_uploaded': '2012-01-16T21:47:53.963799'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030886_2_heart.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.6 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'c561ecaddc7fc85d',
   'date_uploaded': '2012-01-16T21:47:47.123906'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030887_1_liver.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.2 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '8f26a0dbd6e9333d',
   'date_uploaded': '2012-01-16T21:47:51.221231'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030887_2_liver.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.2 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '31dea5361ec9421c',
   'date_uploaded': '2012-01-16T21:47:55.710015'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030888_adipose.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '15.2 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'bfcfb3010da4b6a8',
   'date_uploaded': '2012-01-16T21:47:50.653450'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030889_adrenal.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '15.2 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '15ec091f521ad959',
   'date_uploaded': '2012-01-16T21:47:53.144164'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030890_brain.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '12.8 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '96bbf3272771839b',
   'date_uploaded': '2012-01-16T21:47:50.361833'},
  {'update_time': '2012-01-16 09:48 PM',
   'name': 'ERR030891_breast.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:48 PM',
   'is_private': False,
   'file_size': '15.4 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'c857a19d677604e6',
   'date_uploaded': '2012-01-16T21:48:00.570425'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030892_colon.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '16.0 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'dcf0e672e1333e91',
   'date_uploaded': '2012-01-16T21:47:46.598656'},
  {'update_time': '2012-01-16 09:48 PM',
   'name': 'ERR030893_kidney.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '15.9 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '46df3e684412a294',
   'date_uploaded': '2012-01-16T21:47:59.683890'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030894_heart.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '15.3 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '4fa309035f471214',
   'date_uploaded': '2012-01-16T21:47:58.455682'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030895_liver.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '15.4 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '4e4d02922fcb3933',
   'date_uploaded': '2012-01-16T21:47:56.670498'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030896_lung.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '16.2 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '4df6282897c485c7',
   'date_uploaded': '2012-01-16T21:47:47.389174'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030897_lymph_node.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '16.3 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '432354bd93c21d5f',
   'date_uploaded': '2012-01-16T21:47:48.322755'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030898_prostate.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '16.6 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'c7595b938cd160aa',
   'date_uploaded': '2012-01-16T21:47:59.391636'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030899_skeletal_muscle.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '16.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'fecae70edc3b1987',
   'date_uploaded': '2012-01-16T21:47:48.571176'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030900_white_blood_cells.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '16.5 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '767530b6e7057080',
   'date_uploaded': '2012-01-16T21:47:53.424273'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030901_ovary.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '16.1 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': 'f348cffd461a84b2',
   'date_uploaded': '2012-01-16T21:47:56.290547'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030902_testes.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '16.3 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '8f5304a8adf17a54',
   'date_uploaded': '2012-01-16T21:47:48.029373'},
  {'update_time': '2012-01-16 09:47 PM',
   'name': 'ERR030903_thyroid.fastq',
   'deleted': False,
   'state': 'failed_metadata',
   'is_unrestricted': True,
   'can_manage': False,
   'create_time': '2012-01-16 09:47 PM',
   'is_private': False,
   'file_size': '16.0 GB',
   'file_ext': 'fastqsanger',
   'type': 'file',
   'id': '1f711917a9c0c715',
   'date_uploaded': '2012-01-16T21:47:50.945340'}],
 'metadata': {'parent_library_id': '8bb3ab7690e13de8',
  'can_modify_folder': False,
  'folder_description': '',
  'can_add_library_item': False,
  'full_path': [['Fd795f6d3e169879a', 'Illumina BodyMap 2.0']],
  'folder_name': 'Illumina BodyMap 2.0'}}
In [65]:
gi.datasets.download_dataset('f6a3bd151a3a8748', )
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-65-39a2563afd69> in <module>()
----> 1 gi.datasets.download_dataset('f6a3bd151a3a8748', )

C:\Python\lib\site-packages\bioblend\galaxy\datasets\__init__.py in download_dataset(self, dataset_id, file_path, use_default_filename, wait_for_completion, maxwait)
     78                  Otherwise returns nothing.
     79         """
---> 80         dataset = self._block_until_dataset_terminal(dataset_id, maxwait=maxwait)
     81         if not dataset['state'] == 'ok':
     82             raise DatasetStateException("Dataset state is not 'ok'. Dataset id: %s, current state: %s" % (dataset_id, dataset['state']))

C:\Python\lib\site-packages\bioblend\galaxy\datasets\__init__.py in _block_until_dataset_terminal(self, dataset_id, maxwait, interval)
    149         while True:
    150             dataset = self.show_dataset(dataset_id)
--> 151             state = dataset['state']
    152             if state in terminal_states:
    153                 return dataset

TypeError: string indices must be integers
In [ ]:
dir(gi.ftpfiles)
# gi.ftpfiles.get_ftp_files()
print(gi.libraries.show_library(libs[2][1]))
print(libs[2])

# gi.datasets.show_dataset('f6a3bd151a3a8748', hda_ldda='ldda')
dir(gi.datasets)
In [ ]:
folder_ids['Illumina BodyMap 2.0']
In [ ]:
gi.folders.show_folder('Ff1e7bb3cd7a0f387', contents=True)
# gi.folders.get_folders('Ff1e7bb3cd7a0f387')
# dir(gi.folders)
In [ ]:
gi.libraries.get_folders('00b73c7735c2f000')
# gi.libraries.get_folders('00b73c7735c2f000', folder_id='Fcad2bae1b5da4dc0')
In [ ]:
stats.describe(np.array(lens))
lens_S = pd.Series(lens)
print('{}\n\n{}'.format(lens_S.describe(), lens_S.value_counts()))
In [ ]:
print(gi.folders.show_folder('Fd795f6d3e169879a', contents=False))
print(gi.genomes.show_genome('hg38Patch11').keys())
# print(gi.genomes.show_genome('hg38Patch11')['chrom_info'])
In [ ]:
# gi.folders.show_folder()
print(gi.libraries.get_folders('8bb3ab7690e13de8'))
print(gi.folders.show_folder('Fd795f6d3e169879a'))

So the question is, how to access the individual files within the folder above, and create a workflow where they get processed...

In [ ]:
for d in dir(gi):
    if d[0] != '_':
        print(d)
In [ ]:
dir(gi.datasets)
gi.datasets.show_dataset()
In [ ]:
gi.workflows.get_workflows()
dir(gi.workflows)
In [ ]:
for d in gi.libraries.get_libraries():
    print('{}: {} {:<30} {}'.format(d['id'], d['public'], d['name'], d['description']))
In [75]:
bmap_lib = gi.libraries.get_libraries('8bb3ab7690e13de8')
bmap_lib
Out[75]:
[{'can_user_add': False,
  'description': 'RNA-seq data for the Illumina BodyMap 2.0 project',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '6 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2012-01-16T19:38:50.562525',
  'root_folder_id': 'Fd795f6d3e169879a',
  'model_class': 'Library',
  'id': '8bb3ab7690e13de8',
  'name': 'Illumina BodyMap 2.0'}]
In [81]:
dir(gi)
dir(gi.libraries)
gi.libraries.get_libraries()
# gi.libraries.show_dataset('8bb3ab7690e13de8', 'a24fb72e059884e2')
Out[81]:
[{'can_user_add': False,
  'description': 'Datasets for Tutorials taught by Jeremy Goecks',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '4 months ago',
  'public': True,
  'synopsis': '',
  'create_time': '2018-02-19T23:17:16.048181',
  'root_folder_id': 'Fcad2bae1b5da4dc0',
  'model_class': 'Library',
  'id': '00b73c7735c2f000',
  'name': 'JG Tutorials'},
 {'can_user_add': False,
  'description': '',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '7 months ago',
  'public': True,
  'synopsis': '',
  'create_time': '2017-12-07T16:13:38.662209',
  'root_folder_id': 'F65063aa2c697f935',
  'model_class': 'Library',
  'id': '71da45b5cac57b18',
  'name': 'Test Trinity'},
 {'can_user_add': False,
  'description': 'Metagenomic analysis (454)',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '8 years ago',
  'public': True,
  'synopsis': None,
  'create_time': '2009-08-21T16:29:04.012743',
  'root_folder_id': 'Fed01147b4aa1e8de',
  'model_class': 'Library',
  'id': '175812cd7caaf439',
  'name': 'Windshield splatter'},
 {'can_user_add': False,
  'description': 'Experimental evolution (Illumina)',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '8 years ago',
  'public': True,
  'synopsis': None,
  'create_time': '2009-08-27T13:19:38.050546',
  'root_folder_id': 'F175d7d5ddeeb1d43',
  'model_class': 'Library',
  'id': 'e4de88c47079d971',
  'name': 'Evolutionary Trajectories in a Phage'},
 {'can_user_add': False,
  'description': '',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '8 years ago',
  'public': True,
  'synopsis': None,
  'create_time': '2009-09-18T16:11:33.871776',
  'root_folder_id': 'F5141774efbefaf3e',
  'model_class': 'Library',
  'id': '0bbfef2771c2ff2a',
  'name': 'Codon Usage Frequencies'},
 {'can_user_add': False,
  'description': 'Examples of Illumina, SOLiD, and 454 data',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '8 years ago',
  'public': True,
  'synopsis': 'Use these data to play with Galaxy Tools',
  'create_time': '2010-04-07T13:50:56.453920',
  'root_folder_id': 'F5bee13e9f312df25',
  'model_class': 'Library',
  'id': '0b51cc33f0ee471b',
  'name': 'Sample NGS Datasets'},
 {'can_user_add': False,
  'description': 'Data from the 1000 Genomes Project FTP site',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '8 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2010-06-07T20:07:51.945667',
  'root_folder_id': 'Ff5109f746542d96d',
  'model_class': 'Library',
  'id': 'def1f3d165efeae2',
  'name': '1000 Genomes'},
 {'can_user_add': False,
  'description': 'Human mtDNA resequencing samples',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '8 years ago',
  'public': True,
  'synopsis': 'Sample data for identification of heteroplasmic sites in single individual',
  'create_time': '2010-06-09T14:42:48.549215',
  'root_folder_id': 'Fded0dccf38d68a18',
  'model_class': 'Library',
  'id': '946f67f475408fac',
  'name': 'mtProjectDemo'},
 {'can_user_add': False,
  'description': '',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '7 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2010-07-15T18:00:41.870568',
  'root_folder_id': 'Fd922e573b63af05b',
  'model_class': 'Library',
  'id': 'a25bcf89b6148ac3',
  'name': 'guru_1000GP'},
 {'can_user_add': False,
  'description': 'Irish whole genome sequence and analysis',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '7 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2010-08-04T15:16:07.714203',
  'root_folder_id': 'F90191ce08fa7fc29',
  'model_class': 'Library',
  'id': '208627a06ac7ef49',
  'name': 'Irish whole genome'},
 {'can_user_add': False,
  'description': '',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '7 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2010-08-23T20:31:28.035058',
  'root_folder_id': 'Fbe18635952f0ac68',
  'model_class': 'Library',
  'id': 'e7e502574a48abf0',
  'name': 'He-2010'},
 {'can_user_add': False,
  'description': '',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '7 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2010-09-01T17:07:57.307452',
  'root_folder_id': 'F1c9c3560eba2590b',
  'model_class': 'Library',
  'id': '5d03b168e8ff4c0d',
  'name': 'puc18'},
 {'can_user_add': False,
  'description': '',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '7 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2010-10-22T15:32:33.342003',
  'root_folder_id': 'Fe71ab93152910258',
  'model_class': 'Library',
  'id': 'e3fca37c12996169',
  'name': 'Kasthuri'},
 {'can_user_add': False,
  'description': 'Dynamics of mitochondrial heteroplasmy in three families (Illumina 1.6+)',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '8 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2010-01-07T19:16:00.504010',
  'root_folder_id': 'F96d1ef4698eaa22d',
  'model_class': 'Library',
  'id': 'b9dd912636c41cce',
  'name': 'mtProject'},
 {'can_user_add': False,
  'description': '',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '7 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2010-12-02T20:58:44.037249',
  'root_folder_id': 'Fe97a7f2d0070b57e',
  'model_class': 'Library',
  'id': '5a50031e389c2768',
  'name': 'AC-exome'},
 {'can_user_add': False,
  'description': '',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '7 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2010-12-21T18:32:24.880033',
  'root_folder_id': 'Ff83b81a168e3a581',
  'model_class': 'Library',
  'id': '01f54382e0bec4c8',
  'name': 'the Y'},
 {'can_user_add': False,
  'description': 'Dynamics of the epigenetic landscape during erythroid differentiation after GATA1 restoration',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '7 years ago',
  'public': True,
  'synopsis': 'Dynamics of the epigenetic landscape during erythroid differentiation after GATA1 restoration',
  'create_time': '2011-01-27T20:06:23.038207',
  'root_folder_id': 'Feb41682201c6b5df',
  'model_class': 'Library',
  'id': 'b55b4ccbdd06e42c',
  'name': 'Erythroid Epigenetic Landscape'},
 {'can_user_add': False,
  'description': 'Data for Genome Biology 2011 manuscript',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '7 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2011-06-07T16:44:44.499830',
  'root_folder_id': 'F7928ac94528c9991',
  'model_class': 'Library',
  'id': '95006dbc688dd3fd',
  'name': 'Heteroplasmy'},
 {'can_user_add': False,
  'description': 'Possible phenotypes and disease associations for SNPs in human builds hg18 and hg19.',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '8 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2010-03-22T18:35:25.078919',
  'root_folder_id': 'F5e8d62e584e19e3b',
  'model_class': 'Library',
  'id': 'e660f0c750f4b341',
  'name': 'Putative SNP phenotypes'},
 {'can_user_add': False,
  'description': 'Nucleotide polymorphisms for several threatened species',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '7 years ago',
  'public': True,
  'synopsis': 'Nucleotide polymorphisms for several threatened species',
  'create_time': '2011-04-15T13:47:26.703272',
  'root_folder_id': 'F8282e20cd374b88f',
  'model_class': 'Library',
  'id': '99b05ea914524955',
  'name': 'Genome Diversity'},
 {'can_user_add': False,
  'description': '',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '6 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2011-07-21T22:15:06.812410',
  'root_folder_id': 'F3d30e89e0e6c4a97',
  'model_class': 'Library',
  'id': '2e44b3e708bb57c4',
  'name': 'Variant Detection Demo'},
 {'can_user_add': False,
  'description': 'Sub-samapled versions of datasets used for the Illumina iDEA challenge',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '6 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2011-07-29T17:19:03.802488',
  'root_folder_id': 'F77ce11b6d3c0d73a',
  'model_class': 'Library',
  'id': 'd0c8e88ab05c469f',
  'name': 'Illumina iDEA Datasets (sub-sampled)'},
 {'can_user_add': False,
  'description': '',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '6 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2011-08-09T19:37:47.026198',
  'root_folder_id': 'Ff7361dde000af9b5',
  'model_class': 'Library',
  'id': 'bd305ecbdbed0ec6',
  'name': 'Whale Drop'},
 {'can_user_add': False,
  'description': 'IonPGM',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '6 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2011-09-08T13:11:44.056153',
  'root_folder_id': 'F64546582a2b19794',
  'model_class': 'Library',
  'id': 'b29fdfddd1226ffc',
  'name': 'Coleman'},
 {'can_user_add': False,
  'description': 'Consortium',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '6 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2011-09-28T13:22:58.092868',
  'root_folder_id': 'F1d458aedb2b00738',
  'model_class': 'Library',
  'id': '79da64dbfcb9e25e',
  'name': 'GCAT'},
 {'can_user_add': False,
  'description': 'RNA-seq data for the Illumina BodyMap 2.0 project',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '6 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2012-01-16T19:38:50.562525',
  'root_folder_id': 'Fd795f6d3e169879a',
  'model_class': 'Library',
  'id': '8bb3ab7690e13de8',
  'name': 'Illumina BodyMap 2.0'},
 {'can_user_add': False,
  'description': 'Data from Guo, H., Ingolia, N. T., Weissman, J. S. X Bartel, D. P. Mammalian microRNAs predominantly act to decrease target mRNA levels. Nature 466, 835X840 (2010)',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '6 years ago',
  'public': True,
  'synopsis': 'Ribosomal Profiling and RNA-seq data',
  'create_time': '2012-03-15T09:55:19.508300',
  'root_folder_id': 'Faf1b75bdd6c25367',
  'model_class': 'Library',
  'id': 'acd2639f491ec818',
  'name': 'Ribosome Profiling Data'},
 {'can_user_add': False,
  'description': '',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '6 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2012-03-21T13:19:40.609414',
  'root_folder_id': 'F02175c549a96d749',
  'model_class': 'Library',
  'id': 'f9ba60baa2e6ba6d',
  'name': 'GATK'},
 {'can_user_add': False,
  'description': '',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '6 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2012-06-08T09:31:01.626735',
  'root_folder_id': 'Fb778b6da6fbd6bdf',
  'model_class': 'Library',
  'id': 'a041b331babfb419',
  'name': 'Omics Data'},
 {'can_user_add': False,
  'description': 'Selected files from Illumina iGenomes collection',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '6 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2012-06-08T18:57:53.847450',
  'root_folder_id': 'F10a969fd4d6ed6fe',
  'model_class': 'Library',
  'id': '4ab3a886a95d362e',
  'name': 'iGenomes'},
 {'can_user_add': False,
  'description': 'Archon Genomics X PRIZE Data',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '5 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2012-11-12T17:00:07.316107',
  'root_folder_id': 'F60acc80e3adeed20',
  'model_class': 'Library',
  'id': '1de29d50c3c44272',
  'name': 'XPrize'},
 {'can_user_add': False,
  'description': 'Contains userguide, reference files, and configuration files for the Cloudmap WGS analysis pipeline',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '5 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2012-09-18T06:00:22.564696',
  'root_folder_id': 'Fe2745d1ec24daabf',
  'model_class': 'Library',
  'id': '7aefd5869a9a28ee',
  'name': 'CloudMap'},
 {'can_user_add': False,
  'description': 'Files from \'A high-coverage genome sequence from an archaic Denisovan Individual" Meyer et al. Science 2012 and basic processed data.',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '5 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2012-09-26T14:40:55.762646',
  'root_folder_id': 'F718be9500ee0f13b',
  'model_class': 'Library',
  'id': 'eed0ab5744bc71cc',
  'name': 'Denisovan sequences'},
 {'can_user_add': False,
  'description': 'Data used in examples that demonstrate analysis of ChIP-Seq data',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '6 years ago',
  'public': True,
  'synopsis': 'Use this data to test out and learn Galaxy\'s ChIP-Seq capabilities.  It has been scaled down to relatively small sizes.<br /><br />These files are from <a href="http://bit.ly/QmD6Nk">this mouse ChIP-SEQ experiment in the ENCODE project</a>. These data were generated and analyzed by the labs of <a href="http://snyderlab.stanford.edu/">Michael Snyder at Stanford University</a> and <a href="http://info.med.yale.edu/bcmm/SMW/SMWhome2.html">Sherman Weissman at Yale University</a>.<br /><br />The original files from ENCODE were too large to use in teaching examples, so they have been reduced to contain only data that corresponds to chromosome 19 (the shortest).<br /><br />These files were created by, well, cheating. We first processed the entire dataset, mapping it to MM9. When went back and extracted from the original datasets only those records that eventually mapped to chromosome 19. ',
  'create_time': '2011-09-19T19:36:33.372536',
  'root_folder_id': 'Fca70286e457e0a27',
  'model_class': 'Library',
  'id': '4e6a692dd7d3508d',
  'name': 'ChIP-Seq Mouse Example'},
 {'can_user_add': False,
  'description': 'Exome + Transcriptome Sequencing',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '4 years ago',
  'public': True,
  'synopsis': 'Targeted Exome and Whole Transcriptome Sequencing of 3 Pancreatic Cancer Cell Lines: MiaPaCa2, PANC1, and HPAC',
  'create_time': '2014-01-24T22:32:47.887204',
  'root_folder_id': 'Fd059b71c7ea66d10',
  'model_class': 'Library',
  'id': 'a24fb72e059884e2',
  'name': 'Pancreatic Cancer Cell Lines'},
 {'can_user_add': False,
  'description': 'Data for two papers about the Khoisan and other populations.',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '8 years ago',
  'public': True,
  'synopsis': 'The dataset called "Kim et al." is analyzed in the paper "Khoisan hunter-gatherers have been the largest population throughout most of modern human demographic history", and contains genotypes for 419,969 SNPs from 1462 worldwide individuals including Khoisan populations. The other data underly the analyses reported in the paper "Complete Khoisan and Bantu genomes from southern Africa" by S. C. Schuster et al., published in the journal Nature, February 18, 2010. Each data set can be downloaded and/or imported into a Galaxy history.',
  'create_time': '2010-01-28T15:04:36.905278',
  'root_folder_id': 'F946f67f475408fac',
  'model_class': 'Library',
  'id': 'db23461568f103ca',
  'name': 'Bushman'},
 {'can_user_add': False,
  'description': '',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '3 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2014-10-31T16:14:19.162798',
  'root_folder_id': 'Fd7aafbcc7ecb00e4',
  'model_class': 'Library',
  'id': '508faf565c0560ef',
  'name': 'Charts Example Data'},
 {'can_user_add': False,
  'description': 'Demonstration datasets collected from various Galaxy tutorials',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '3 years ago',
  'public': True,
  'synopsis': '',
  'create_time': '2015-01-11T18:34:50.603957',
  'root_folder_id': 'F00fdabcadd09fb14',
  'model_class': 'Library',
  'id': '6f124c2ade81ff6d',
  'name': 'Demonstration Datasets'},
 {'can_user_add': False,
  'description': 'Datasets used in small SNP calling example.',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '3 years ago',
  'public': True,
  'synopsis': 'Datasets used in small SNP calling example.',
  'create_time': '2015-01-13T13:45:12.784365',
  'root_folder_id': 'F55aa743125c2465d',
  'model_class': 'Library',
  'id': '1b1913d85ad75c7d',
  'name': 'Wolbachia Example'},
 {'can_user_add': False,
  'description': 'Data from Rebolledo-Jaramillo et al. 2014',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '2 years ago',
  'public': True,
  'synopsis': 'fastq reads (sanger format)',
  'create_time': '2016-01-28T14:15:18.276124',
  'root_folder_id': 'Fbdc8f33b87a051b7',
  'model_class': 'Library',
  'id': 'f11cf06cdcc13cfd',
  'name': 'Mt Study 2014'},
 {'can_user_add': False,
  'description': 'dbSNP releases',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '2 years ago',
  'public': True,
  'synopsis': 'dbSNP releases in VCF format',
  'create_time': '2016-03-22T14:27:03.614672',
  'root_folder_id': 'Fe06719afe7c27e53',
  'model_class': 'Library',
  'id': 'ca858c4c28f5b301',
  'name': 'dbSNP'},
 {'can_user_add': False,
  'description': 'datasets used in Galaxy tutorials',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '2 years ago',
  'public': True,
  'synopsis': 'tutorial data',
  'create_time': '2016-03-23T14:07:18.963166',
  'root_folder_id': 'F60eb9d65ee96ba3b',
  'model_class': 'Library',
  'id': '9b1360d4e4e0c900',
  'name': 'Tutorials'},
 {'can_user_add': False,
  'description': 'esta es la descripcion',
  'deleted': False,
  'can_user_manage': False,
  'can_user_modify': False,
  'create_time_pretty': '2 years ago',
  'public': True,
  'synopsis': 'esta es la sinopsis',
  'create_time': '2016-06-28T16:57:32.447470',
  'root_folder_id': 'F808f1a4b3c43ed0e',
  'model_class': 'Library',
  'id': '2828cd96b51204d5',
  'name': 'prueba'}]
In [89]:
folders = gi.libraries.get_folders('8bb3ab7690e13de8')
for fold in folders:
    print(type(fold))
folders
<class 'dict'>
Out[89]:
[{'url': '/api/libraries/8bb3ab7690e13de8/contents/Fd795f6d3e169879a',
  'type': 'folder',
  'name': '/',
  'id': 'Fd795f6d3e169879a'}]
In [105]:
folder = gi.libraries.show_folder(library_id='8bb3ab7690e13de8', folder_id='Fd795f6d3e169879a')
folder
# dir(gi.folders.show_folder('Fd795f6d3e169879a', contents=True))
# gi.folders.show_folder('Fd795f6d3e169879a', contents=True)
# gi.datasets.show_dataset('bd99481f96cf6512')
# dir(folder)
Out[105]:
{'parent_library_id': '8bb3ab7690e13de8',
 'update_time': '2012-01-16T21:48:00.818444',
 'description': '',
 'deleted': False,
 'item_count': 48,
 'parent_id': None,
 'genome_build': None,
 'model_class': 'LibraryFolder',
 'library_path': [],
 'id': 'Fd795f6d3e169879a',
 'name': 'Illumina BodyMap 2.0'}
In [195]:
folders_dict = gi.folders.show_folder('Fd795f6d3e169879a', contents=True)
In [203]:
for d in folders_dict['folder_contents']:
    if d['name'][-13:] == 'thyroid.fastq':
        print(d)
{'update_time': '2012-01-16 09:47 PM', 'name': 'ERR030872_1_thyroid.fastq', 'deleted': False, 'state': 'failed_metadata', 'is_unrestricted': True, 'can_manage': False, 'create_time': '2012-01-16 09:47 PM', 'is_private': False, 'file_size': '12.5 GB', 'file_ext': 'fastqsanger', 'type': 'file', 'id': 'f6a3bd151a3a8748', 'date_uploaded': '2012-01-16T21:47:58.770405'}
{'update_time': '2012-01-16 09:47 PM', 'name': 'ERR030872_2_thyroid.fastq', 'deleted': False, 'state': 'failed_metadata', 'is_unrestricted': True, 'can_manage': False, 'create_time': '2012-01-16 09:47 PM', 'is_private': False, 'file_size': '12.5 GB', 'file_ext': 'fastqsanger', 'type': 'file', 'id': 'ae0b7240c36ccba5', 'date_uploaded': '2012-01-16T21:47:56.969001'}
{'update_time': '2012-01-16 09:47 PM', 'name': 'ERR030903_thyroid.fastq', 'deleted': False, 'state': 'failed_metadata', 'is_unrestricted': True, 'can_manage': False, 'create_time': '2012-01-16 09:47 PM', 'is_private': False, 'file_size': '16.0 GB', 'file_ext': 'fastqsanger', 'type': 'file', 'id': '1f711917a9c0c715', 'date_uploaded': '2012-01-16T21:47:50.945340'}
In [ ]:
gi.histories.upload_dataset_from_library(, 'f6a3bd151a3a8748')
In [208]:
dir(gi.histories)
gi.histories.get_current_history()
hist_id = gi.histories.get_current_history()['id']
In [210]:
dir(gi.histories)
# gi.histories.show_dataset_collection(hist_id)
gi.histories.upload_dataset_from_library(hist_id, '1f711917a9c0c715')
Out[210]:
{'accessible': True,
 'type_id': 'dataset-bbd44e69cb8906b545b68d86e45e2c36',
 'resubmitted': False,
 'create_time': '2018-07-08T03:37:07.012666',
 'creating_job': 'fbc2d1e0eccf013b',
 'dataset_id': '7774608c247bd9c7',
 'file_size': 17144074226,
 'file_ext': 'fastqsanger',
 'id': 'bbd44e69cb8906b545b68d86e45e2c36',
 'misc_info': 'uploaded fastq file',
 'hda_ldda': 'hda',
 'download_url': '/api/histories/6129cc23f6415d9a/contents/bbd44e69cb8906b545b68d86e45e2c36/display',
 'state': 'ok',
 'display_types': [],
 'display_apps': [],
 'metadata_dbkey': '?',
 'type': 'file',
 'misc_blurb': '16.0 Gb',
 'peek': '<table cellspacing="0" cellpadding="3"><tr><td>@ERR030903.1 HWI-BRUNOP16X_0001:8:1:4264:1030#0/1</td></tr><tr><td>NNTCGAGTGAGATATCAGGTTCTATAGAGCGCCTTGAAAAAAGTAATACCTAACCATAAAGTTAAAATCTATGTA</td></tr><tr><td>+</td></tr><tr><td>###########################################################################</td></tr><tr><td>@ERR030903.2 HWI-BRUNOP16X_0001:8:1:6442:1029#0/1</td></tr><tr><td>NNATAGTAATAATCCCCATCCTCCATATATCCAAACAACAAAGCATAACATTTCGCCCACTAAGCCAATCACTTT</td></tr></table>',
 'update_time': '2018-07-08T03:37:07.164604',
 'data_type': 'galaxy.datatypes.sequence.FastqSanger',
 'tags': [],
 'deleted': False,
 'history_id': '6129cc23f6415d9a',
 'meta_files': [],
 'genome_build': '?',
 'metadata_sequences': None,
 'hid': 3,
 'model_class': 'HistoryDatasetAssociation',
 'metadata_data_lines': None,
 'annotation': None,
 'uuid': None,
 'history_content_type': 'dataset',
 'name': 'ERR030903_thyroid.fastq',
 'extension': 'fastqsanger',
 'visible': True,
 'url': '/api/histories/6129cc23f6415d9a/contents/bbd44e69cb8906b545b68d86e45e2c36',
 'visualizations': [{'description': None,
   'embeddable': False,
   'href': '/plugins/interactive_environments/jupyter/show',
   'entry_point': {'type': 'mako', 'attr': {}, 'file': 'jupyter.mako'},
   'groups': None,
   'logo': None,
   'specs': None,
   'target': 'galaxy_main',
   'name': 'jupyter',
   'title': None,
   'settings': None,
   'html': 'Jupyter'}],
 'rerunnable': False,
 'purged': False,
 'api_type': 'file'}
In [217]:
# folders_dict
gi.datasets.show_dataset('a041b331babfb419')
Out[217]:
'400 Bad Request\nThe server could not comply with the request since\r\nit is either malformed or otherwise incorrect.\r\n\nInvalid dataset id: 199.\n'
In [211]:
hs_chr = 'GRCh38'
dir(gi.genomes)
for gen in gi.genomes.get_genomes():
    if 'hg38' in gen[1]:
        print(gen)
#     if hs_chr in gen[0]:
#         print(gen)
#     print(gen)
#     if 'sapiens' in gen[0]:
#         print(gen)
#     if gen[0][:4] == 'Homo':
#         print(gen)
['tarInv Dec. 2013 (GRCh38Tar/tarIhg38) (tarIhg38)', 'tarIhg38']
['GRCh38.p6 Dec. 2015 (hg38Patch6)', 'hg38Patch6']
['GRCh38.p7 Mar. 2016 (hg38Patch7)', 'hg38Patch7']
['GRCh38.p5 Sep. 2015 (hg38Patch5)', 'hg38Patch5']
['GRCh38.p2 Dec. 2014 (hg38Patch2)', 'hg38Patch2']
['GRCh38.p3 Apr. 2015 (hg38Patch3)', 'hg38Patch3']
['Human Dec. 2013 (GRCh38/hg38) (hg38)', 'hg38']
['GRCh38.p9 Sep. 2016 (hg38Patch9)', 'hg38Patch9']
['GRCh38.p11 Jun. 2017 (hg38Patch11)', 'hg38Patch11']
In [224]:
gi.genomes.get_genomes()
dir(gi.genomes)
gi.genomes.show_genome('hg38Patch11')
gi.genomes.install_genome(ncbi_name='hg38Patch11')
Out[224]:
{'reference': False,
 'next_chroms': False,
 'prev_chroms': False,
 'chrom_info': [{'chrom': 'chr1_KN196472v1_fix', 'len': 186494},
  {'chrom': 'chr1_KN196473v1_fix', 'len': 166200},
  {'chrom': 'chr1_KN196474v1_fix', 'len': 122022},
  {'chrom': 'chr1_KN538360v1_fix', 'len': 460100},
  {'chrom': 'chr1_KN538361v1_fix', 'len': 305542},
  {'chrom': 'chr1_KQ031383v1_fix', 'len': 467143},
  {'chrom': 'chr1_KQ458382v1_alt', 'len': 141019},
  {'chrom': 'chr1_KQ458383v1_alt', 'len': 349938},
  {'chrom': 'chr1_KQ458384v1_alt', 'len': 212205},
  {'chrom': 'chr1_KQ983255v1_alt', 'len': 278659},
  {'chrom': 'chr1_KV880763v1_alt', 'len': 551020},
  {'chrom': 'chr1_KZ208904v1_alt', 'len': 166136},
  {'chrom': 'chr1_KZ208905v1_alt', 'len': 140355},
  {'chrom': 'chr1_KZ208906v1_fix', 'len': 330031},
  {'chrom': 'chr2_KN538362v1_fix', 'len': 208149},
  {'chrom': 'chr2_KN538363v1_fix', 'len': 365499},
  {'chrom': 'chr2_KQ031384v1_fix', 'len': 481245},
  {'chrom': 'chr2_KQ983256v1_alt', 'len': 535088},
  {'chrom': 'chr2_KZ208907v1_alt', 'len': 181658},
  {'chrom': 'chr2_KZ208908v1_alt', 'len': 140361},
  {'chrom': 'chr3_KN196475v1_fix', 'len': 451168},
  {'chrom': 'chr3_KN196476v1_fix', 'len': 305979},
  {'chrom': 'chr3_KN538364v1_fix', 'len': 415308},
  {'chrom': 'chr3_KQ031385v1_fix', 'len': 373699},
  {'chrom': 'chr3_KQ031386v1_fix', 'len': 165718},
  {'chrom': 'chr3_KV766192v1_fix', 'len': 411654},
  {'chrom': 'chr3_KZ208909v1_alt', 'len': 175849},
  {'chrom': 'chr4_KQ090013v1_alt', 'len': 90922},
  {'chrom': 'chr4_KQ090014v1_alt', 'len': 163749},
  {'chrom': 'chr4_KQ090015v1_alt', 'len': 236512},
  {'chrom': 'chr4_KQ983257v1_fix', 'len': 230434},
  {'chrom': 'chr4_KQ983258v1_alt', 'len': 205407},
  {'chrom': 'chr4_KV766193v1_alt', 'len': 420675},
  {'chrom': 'chr5_KN196477v1_alt', 'len': 139087},
  {'chrom': 'chr5_KV575243v1_alt', 'len': 362221},
  {'chrom': 'chr5_KV575244v1_fix', 'len': 673059},
  {'chrom': 'chr5_KZ208910v1_alt', 'len': 135987},
  {'chrom': 'chr6_KN196478v1_fix', 'len': 268330},
  {'chrom': 'chr6_KQ031387v1_fix', 'len': 320750},
  {'chrom': 'chr6_KQ090016v1_fix', 'len': 245716},
  {'chrom': 'chr6_KQ090017v1_alt', 'len': 82315},
  {'chrom': 'chr6_KV766194v1_fix', 'len': 139427},
  {'chrom': 'chr6_KZ208911v1_fix', 'len': 242796},
  {'chrom': 'chr7_KQ031388v1_fix', 'len': 179932},
  {'chrom': 'chr7_KV880764v1_fix', 'len': 142129},
  {'chrom': 'chr7_KV880765v1_fix', 'len': 468267},
  {'chrom': 'chr7_KZ208912v1_fix', 'len': 589656},
  {'chrom': 'chr7_KZ208913v1_alt', 'len': 680662},
  {'chrom': 'chr8_KV880766v1_fix', 'len': 156998},
  {'chrom': 'chr8_KV880767v1_fix', 'len': 265876},
  {'chrom': 'chr8_KZ208914v1_fix', 'len': 165120},
  {'chrom': 'chr8_KZ208915v1_fix', 'len': 6367528},
  {'chrom': 'chr9_KN196479v1_fix', 'len': 330164},
  {'chrom': 'chr9_KQ090018v1_alt', 'len': 163882},
  {'chrom': 'chr9_KQ090019v1_alt', 'len': 134099},
  {'chrom': 'chr10_KN196480v1_fix', 'len': 277797},
  {'chrom': 'chr10_KN538365v1_fix', 'len': 14347},
  {'chrom': 'chr10_KN538366v1_fix', 'len': 85284},
  {'chrom': 'chr10_KN538367v1_fix', 'len': 420164},
  {'chrom': 'chr10_KQ090020v1_alt', 'len': 185507},
  {'chrom': 'chr10_KQ090021v1_fix', 'len': 264545},
  {'chrom': 'chr11_KN196481v1_fix', 'len': 108875},
  {'chrom': 'chr11_KN538368v1_alt', 'len': 203552},
  {'chrom': 'chr11_KQ090022v1_fix', 'len': 181958},
  {'chrom': 'chr11_KQ759759v1_fix', 'len': 196940},
  {'chrom': 'chr11_KV766195v1_fix', 'len': 140877},
  {'chrom': 'chr12_KN196482v1_fix', 'len': 211377},
  {'chrom': 'chr12_KN538369v1_fix', 'len': 541038},
  {'chrom': 'chr12_KN538370v1_fix', 'len': 86533},
  {'chrom': 'chr12_KQ090023v1_alt', 'len': 109323},
  {'chrom': 'chr12_KQ759760v1_fix', 'len': 315610},
  {'chrom': 'chr12_KZ208916v1_fix', 'len': 1046838},
  {'chrom': 'chr12_KZ208917v1_fix', 'len': 64689},
  {'chrom': 'chr12_KZ208918v1_alt', 'len': 174808},
  {'chrom': 'chr13_KN196483v1_fix', 'len': 35455},
  {'chrom': 'chr13_KN538371v1_fix', 'len': 206320},
  {'chrom': 'chr13_KN538372v1_fix', 'len': 356766},
  {'chrom': 'chr13_KN538373v1_fix', 'len': 148762},
  {'chrom': 'chr13_KQ090024v1_alt', 'len': 168146},
  {'chrom': 'chr13_KQ090025v1_alt', 'len': 123480},
  {'chrom': 'chr14_KZ208919v1_alt', 'len': 171798},
  {'chrom': 'chr14_KZ208920v1_fix', 'len': 690932},
  {'chrom': 'chr15_KN538374v1_fix', 'len': 4998962},
  {'chrom': 'chr15_KQ031389v1_alt', 'len': 2365364},
  {'chrom': 'chr16_KQ031390v1_alt', 'len': 169136},
  {'chrom': 'chr16_KQ090026v1_alt', 'len': 59016},
  {'chrom': 'chr16_KQ090027v1_alt', 'len': 267463},
  {'chrom': 'chr16_KV880768v1_fix', 'len': 1927115},
  {'chrom': 'chr16_KZ208921v1_alt', 'len': 78609},
  {'chrom': 'chr17_KV575245v1_fix', 'len': 154723},
  {'chrom': 'chr17_KV766196v1_fix', 'len': 281919},
  {'chrom': 'chr17_KV766197v1_alt', 'len': 246895},
  {'chrom': 'chr17_KV766198v1_alt', 'len': 276292},
  {'chrom': 'chr18_KQ090028v1_fix', 'len': 407387},
  {'chrom': 'chr18_KQ458385v1_alt', 'len': 205101},
  {'chrom': 'chr18_KZ208922v1_fix', 'len': 93070},
  {'chrom': 'chr19_KN196484v1_fix', 'len': 370917},
  {'chrom': 'chr19_KQ458386v1_fix', 'len': 405389},
  {'chrom': 'chr19_KV575246v1_alt', 'len': 163926},
  {'chrom': 'chr19_KV575247v1_alt', 'len': 170206},
  {'chrom': 'chr19_KV575248v1_alt', 'len': 168131},
  {'chrom': 'chr19_KV575249v1_alt', 'len': 293522},
  {'chrom': 'chr19_KV575250v1_alt', 'len': 241058},
  {'chrom': 'chr19_KV575251v1_alt', 'len': 159285},
  {'chrom': 'chr19_KV575252v1_alt', 'len': 178197},
  {'chrom': 'chr19_KV575253v1_alt', 'len': 166713},
  {'chrom': 'chr19_KV575254v1_alt', 'len': 99845},
  {'chrom': 'chr19_KV575255v1_alt', 'len': 161095},
  {'chrom': 'chr19_KV575256v1_alt', 'len': 223118},
  {'chrom': 'chr19_KV575257v1_alt', 'len': 100553},
  {'chrom': 'chr19_KV575258v1_alt', 'len': 156965},
  {'chrom': 'chr19_KV575259v1_alt', 'len': 171263},
  {'chrom': 'chr19_KV575260v1_alt', 'len': 145691},
  {'chrom': 'chr22_KN196485v1_alt', 'len': 156562},
  {'chrom': 'chr22_KN196486v1_alt', 'len': 153027},
  {'chrom': 'chr22_KQ458387v1_alt', 'len': 155930},
  {'chrom': 'chr22_KQ458388v1_alt', 'len': 174749},
  {'chrom': 'chr22_KQ759761v1_alt', 'len': 145162},
  {'chrom': 'chr22_KQ759762v1_fix', 'len': 101037},
  {'chrom': 'chrX_KV766199v1_alt', 'len': 188004},
  {'chrom': 'chrY_KN196487v1_fix', 'len': 101150},
  {'chrom': 'chrY_KZ208923v1_fix', 'len': 48370},
  {'chrom': 'chrY_KZ208924v1_fix', 'len': 209722}],
 'id': 'hg38Patch11',
 'start_index': 0}
In [194]:
dir(gi.folders)
# gi.folders.create_folder(parent_folder_id=None, name='bmap')
dir(gi.histories)
Out[194]:
['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_delete',
 '_get',
 '_get_retry_delay',
 '_max_get_retries',
 '_post',
 '_put',
 'create_dataset_collection',
 'create_history',
 'create_history_tag',
 'delete_dataset',
 'delete_dataset_collection',
 'delete_history',
 'download_dataset',
 'download_history',
 'export_history',
 'get_current_history',
 'get_histories',
 'get_most_recently_used_history',
 'get_retry_delay',
 'get_status',
 'gi',
 'max_get_retries',
 'module',
 'set_get_retry_delay',
 'set_max_get_retries',
 'show_dataset',
 'show_dataset_collection',
 'show_dataset_provenance',
 'show_history',
 'show_matching_datasets',
 'undelete_history',
 'update_dataset',
 'update_dataset_collection',
 'update_history',
 'upload_dataset_from_library',
 'url']
In [185]:
dir(gi.genomes)
gi.workflows.get_workflows()
Out[185]:
[{'name': 'imported: Galaxy RNA-seq analysis',
  'tags': [],
  'deleted': False,
  'latest_workflow_uuid': 'f8ded1e4-2a1a-410d-b8ca-b57d0ce45f19',
  'show_in_tool_panel': False,
  'url': '/api/workflows/528c8b401283ead1',
  'number_of_steps': 47,
  'published': False,
  'owner': 'drew_m',
  'model_class': 'StoredWorkflow',
  'id': '528c8b401283ead1'}]
In [223]:
dir(gi)
dir(gi.datasets)
# dir(gi.datasets.show_dataset)
# gi.datasets.show_dataset('a041b331babfb419')
gi.libraries.show_library('a041b331babfb419')
gi.folders.show_folder('Fb778b6da6fbd6bdf', contents=True)
Out[223]:
{'folder_contents': [{'update_time': '2012-06-29 01:20 PM',
   'can_modify': False,
   'deleted': False,
   'name': 'fastq',
   'can_manage': False,
   'create_time': '2012-06-29 01:20 PM',
   'type': 'folder',
   'id': 'F667769a15bbe3c33',
   'description': 'fastq files from SRA'},
  {'update_time': '2012-07-02 04:24 PM',
   'can_modify': False,
   'deleted': False,
   'can_manage': False,
   'create_time': '2012-07-02 04:24 PM',
   'type': 'folder',
   'id': 'F116cfb5e7bed4bcc',
   'name': 'Pilot Data'},
  {'update_time': '2012-07-02 04:36 PM',
   'name': 'Blood',
   'deleted': False,
   'state': 'ok',
   'is_unrestricted': True,
   'id': '6c8d1f2cd3ef0ce8',
   'can_manage': False,
   'create_time': '2012-07-02 04:36 PM',
   'is_private': False,
   'file_size': '24.4 MB',
   'file_ext': 'bam',
   'type': 'file',
   'message': 'Blood WGS data mapped with BWA against hg19. ChrM data with properly mapped reads only.',
   'date_uploaded': '2012-07-02T16:36:03.168995'},
  {'update_time': '2012-07-02 04:42 PM',
   'name': 'Cheek',
   'deleted': False,
   'state': 'ok',
   'is_unrestricted': True,
   'id': 'dc2254ddc621068b',
   'can_manage': False,
   'create_time': '2012-07-02 04:42 PM',
   'is_private': False,
   'file_size': '7.8 MB',
   'file_ext': 'bam',
   'type': 'file',
   'message': 'Cheek WGS data mapped with BWA against hg19. ChrM data with properly mapped reads only.',
   'date_uploaded': '2012-07-02T16:42:02.601004'},
  {'update_time': '2012-06-08 09:32 AM',
   'name': 'FASTA sequences of Target Regions',
   'deleted': False,
   'state': 'ok',
   'is_unrestricted': True,
   'id': 'bcf1cb32a3b3edb0',
   'can_manage': False,
   'create_time': '2012-06-08 09:32 AM',
   'is_private': False,
   'file_size': '68.8 KB',
   'file_ext': 'fasta',
   'type': 'file',
   'message': 'chrM, XBP1, CDKN2a, and ALEX exon of GNAS1 in FASTA format',
   'date_uploaded': '2012-06-08T09:32:00.585163'},
  {'update_time': '2012-06-29 01:03 PM',
   'name': 'mtdna.interval',
   'deleted': False,
   'state': 'ok',
   'is_unrestricted': True,
   'id': 'eecfd62310be299d',
   'can_manage': False,
   'create_time': '2012-06-29 01:03 PM',
   'is_private': False,
   'file_size': '836 bytes',
   'file_ext': 'interval',
   'type': 'file',
   'message': 'Coordinates of mtDNA genes oer GenBank record',
   'date_uploaded': '2012-06-29T13:03:18.784845'},
  {'update_time': '2012-06-08 09:31 AM',
   'name': 'Target Regions',
   'deleted': False,
   'state': 'ok',
   'is_unrestricted': True,
   'id': 'd893258db7cbffa1',
   'can_manage': False,
   'create_time': '2012-06-08 09:31 AM',
   'is_private': False,
   'file_size': '84 bytes',
   'file_ext': 'interval',
   'type': 'file',
   'message': 'chrM, XBP1, CDKN2a, and ALEX exon of GNAS1 in interval format',
   'date_uploaded': '2012-06-08T09:31:59.524302'},
  {'update_time': '2012-06-08 09:32 AM',
   'name': 'Target Regions',
   'deleted': False,
   'state': 'ok',
   'is_unrestricted': True,
   'id': '21194465bf7af4d3',
   'can_manage': False,
   'create_time': '2012-06-08 09:32 AM',
   'is_private': False,
   'file_size': '84 bytes',
   'file_ext': 'gatk_interval',
   'type': 'file',
   'message': 'chrM, XBP1, CDKN2a, and ALEX exon of GNAS1 in interval format',
   'date_uploaded': '2012-06-08T09:32:00.217624'},
  {'update_time': '2012-07-02 04:26 PM',
   'name': 'Time Point 1 BWA mt only',
   'deleted': False,
   'state': 'ok',
   'is_unrestricted': True,
   'id': 'a187c9321d9a5640',
   'can_manage': False,
   'create_time': '2012-07-02 04:26 PM',
   'is_private': False,
   'file_size': '12.7 MB',
   'file_ext': 'bam',
   'type': 'file',
   'message': 'Mapped against hg19 with BWA. Filtered down to chrM and only properly paired reads are saved.',
   'date_uploaded': '2012-07-02T16:26:48.512217'},
  {'update_time': '2012-07-02 04:26 PM',
   'name': 'Time Point 1 TopHat mt only',
   'deleted': False,
   'state': 'ok',
   'is_unrestricted': True,
   'id': 'f5d1bcace10946b5',
   'can_manage': False,
   'create_time': '2012-07-02 04:26 PM',
   'is_private': False,
   'file_size': '11.7 MB',
   'file_ext': 'bam',
   'type': 'file',
   'message': 'Mapped against hg19 with TopHat 1.4. Filtered to chrM only. Only properly paired reads are saved.',
   'date_uploaded': '2012-07-02T16:26:56.968551'}],
 'metadata': {'parent_library_id': 'a041b331babfb419',
  'can_modify_folder': False,
  'folder_description': '',
  'can_add_library_item': False,
  'full_path': [['Fb778b6da6fbd6bdf', 'Omics Data']],
  'folder_name': 'Omics Data'}}
In [83]:
from bioblend.galaxy.objects import GalaxyInstance
gi = GalaxyInstance('https://usegalaxy.org/', galaxy_api)
# wf = gi.workflows.list()[0]
wf = gi.workflows.list()
hist = gi.histories.list()[0]
inputs = hist.get_datasets()[:2]
input_map = dict(zip(wf.input_labels, inputs))
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-83-fcb7435f91ac> in <module>()
      5 hist = gi.histories.list()[0]
      6 inputs = hist.get_datasets()[:2]
----> 7 input_map = dict(zip(wf.input_labels, inputs))

AttributeError: 'list' object has no attribute 'input_labels'
In [85]:
dir(gi.workflows)
gi.workflows.list()
Out[85]:
[]
In [63]:
description1 = 'This will be the main folder, within the root dir, to hold data \
for DCM\'s analysis of the Illumina BodyMap Data on a Galaxy Main instance.'
gi.folders.create_folder(parent_folder_id=None, name='bodymap', description=description1)

description2 = 'This will be the folder to hold the H. sapiens GRCh38 reference genome.'
gi.folders.create_folder(parent_folder_id='bodymap', name='hs_chr', description=description2)

description3 = 'This folder will hold the raw Illumina BodyMap reads from the NCBI SRA.'
gi.folders.create_folder(parent_folder_id='bodymap', name='bmap_raw', description=description3)
---------------------------------------------------------------------------
ConnectionError                           Traceback (most recent call last)
<ipython-input-63-37052bfc08c1> in <module>()
      1 description1 = 'This will be the main folder, within the root dir, to hold data for DCM\'s analysis of the Illumina BodyMap Data on a Galaxy Main instance.'
----> 2 gi.folders.create_folder(parent_folder_id=None, name='bodymap', description=description1)
      3 
      4 description2 = 'This will be the folder to hold the H. sapiens GRCh38 reference genome.'
      5 gi.folders.create_folder(parent_folder_id='bodymap', name='hs_chr', description=description2)

C:\Python\lib\site-packages\bioblend\galaxy\folders\__init__.py in create_folder(self, parent_folder_id, name, description)
     30         if description:
     31             payload['description'] = description
---> 32         return self._post(payload=payload, id=parent_folder_id)
     33 
     34     def show_folder(self, folder_id, contents=False):

C:\Python\lib\site-packages\bioblend\galaxy\client.py in _post(self, payload, id, deleted, contents, url, files_attached)
    150                                     contents=contents)
    151         return self.gi.make_post_request(url, payload=payload,
--> 152                                          files_attached=files_attached)
    153 
    154     def _put(self, payload, id=None, url=None, params=None):

C:\Python\lib\site-packages\bioblend\galaxyclient.py in make_post_request(self, url, payload, params, files_attached)
    144         # @see self.body for HTTP response body
    145         raise ConnectionError("Unexpected HTTP status code: %s" % r.status_code,
--> 146                               body=r.text, status_code=r.status_code)
    147 
    148     def make_delete_request(self, url, payload=None, params=None):

ConnectionError: Unexpected HTTP status code: 500: {"err_msg": "Uncaught exception in exposed API method:", "err_code": 0}
In [61]:
dir(gi)
Out[61]:
['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_key',
 '_make_url',
 'base_url',
 'config',
 'datasets',
 'datatypes',
 'default_params',
 'folders',
 'forms',
 'ftpfiles',
 'genomes',
 'get_retry_delay',
 'groups',
 'histories',
 'jobs',
 'json_headers',
 'key',
 'libraries',
 'make_delete_request',
 'make_get_request',
 'make_post_request',
 'make_put_request',
 'max_get_attempts',
 'quotas',
 'roles',
 'timeout',
 'toolShed',
 'tool_data',
 'tools',
 'url',
 'users',
 'verify',
 'visual',
 'workflows']
In [51]:
complex_ = r'C:\Users\DMacKellar\Documents\Data\Bio\Bmap\split3\map_2018-07-02T20_43_53\NC_000020_11_ERR030871.sam'
print(os.path.basename(complex_))
NC_000020_11_ERR030871.sam

Side Note: Setting Up Ubuntu on Windows

Incidentally, since this is the first real use of the Ubuntu bash on Windows for me, I found that it didn't have tab autocompletion activated, so I followed these instructions, modifying C:\Users\DMacKellar\AppData\Local\lxss\root.bashrc with Notepad++ to uncomment the last 3 lines. Now it works, but it beeps annoyingly whenever the automcompletion doesn't execute perfectly. I'll follow these instructions to attempt to remedy that. I used vim from within the Ubuntu terminal to uncomment line 21 in /etc/inputrc. ...Ok, that got it.

A couple more points:

  • When it comes to copying terminal output to the Windows clipboard (for the all-important copying error tracebacks and pasting them into Google), I had to use this, clicking the icon in the upper-left of the Bash terminal window, selecting Properties -> Options -> Quick Edit Mode. Hopefully this doesn't reset upon re-opening Bash (it doesn't appear to right now).

  • When it comes to pasting from the Windows clipboard TO the terminal, Ctrl+V doesn't work, but right-clicking the trackpad works.

I still was getting an error when attempting to launch FastQC; something about Java needing X windows support:

Exception in thread "main" java.awt.HeadlessException:
No X11 DISPLAY variable was set, but this program performed an operation which requires it.
        at java.awt.GraphicsEnvironment.checkHeadless(GraphicsEnvironment.java:204)
        at java.awt.Window.<init>(Window.java:536)
        at java.awt.Frame.<init>(Frame.java:420)
        at java.awt.Frame.<init>(Frame.java:385)
        at javax.swing.JFrame.<init>(JFrame.java:189)
        at uk.ac.babraham.FastQC.FastQCApplication.<init>(FastQCApplication.java:63)
        at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:332)

This site offered help for that, saying to use Xming. That program sounded familiar from my Harvard days, so I checked and found that it's still installed on the PC. After activating it, attempting to launch FastQC still yielded the same error, but I then entered into the terminal:

export DISPLAY=:0

and tried again and now it launches!

But... more complications. FastQC doesn't have access to the Windows directory structure, where I left the fastq files. I need to figure out where in the Windows dir system the linux directory structure is being mounted.

Ah, ok: the structure within the Ubuntu shell mirrors that of Windows, from the (within-Ubuntu) location of '/mnt/c/'; i.e., within that dir, Ubuntu can access anything on the C:\ drive within Windows. As this post says, the actual location of that within the Windows system is at 'C:\Users\DMacKellar\AppData\Local\lxss\rootfs', but it's not recommended to make drastic changes from within the Windows end. In other words, the best way to access the fastq files is to point the Ubuntu terminal at '/mnt/c/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq'.

When I try to open Galaxy2-[adrenal_1.fastq].fastqsanger, however, the X-window for Fastqc says 'Read 48000 sequences (95%)', and doesn't seem to change over time. TaskManager seems to indicate that it isn't taking up much resources, so it may be stuck, or it may need more time, or it may be that it won't work loading such a file from within a window. The Ubuntu terminal is outputting warnings and errors as a result of the open X window (whether from just loading the program, or due to processes launched upon attempting to open the fastq file, I'm not certain), saying that it can't find various files in '/etc/fastqc/Configuration/'.

I wonder if the command line would be better. It sounds like to run command line mode, you just call the program and then follow immediately with the fastq file you want to analyze. I killed the process so I could try again with this approach:

fastqc /mnt/c/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/Galaxy2-[adrenal_1.fastq].fastqsanger

Ok, so that at least told me:

Failed to process file Galaxy2-[adrenal_1.fastq].fastqsanger
java.lang.IllegalArgumentException: No key called gc_sequence:ignore in the config data

Googling that, the first hit says that another user fixed this error by copying the contents of some other Fastqc config dir over to /etc/fastqc dir, but I can't find the configuration location to which they're referring. Doing a 'find' command for fastqc returns:

./usr/bin/fastqc
./usr/share/applications/fastqc.desktop
./usr/share/doc/fastqc
./usr/share/fastqc
./usr/share/icons/hicolor/32x32/apps/fastqc_icon.png
./usr/share/java/fastqc-0.11.4.jar
./usr/share/java/fastqc.jar
./usr/share/man/man1/fastqc.1.gz
./var/cache/apt/archives/fastqc_0.11.4+dfsg-3_all.deb
./var/lib/dpkg/info/fastqc.list
./var/lib/dpkg/info/fastqc.md5sums

The only likely dirs here are /usr/bin/fastqc/, /usr/share/doc/fastqc/, and /usr/share/fastqc/, and none of them contain a Config file. Digging a bit deeper, it sounds like other users are surprised about the installation they get when they go through 'sudo apt-get install fastqc'. Apparently most such installs aren't expected to go to '/etc/'. The docs say it's a Java program, and so should have a minimal footprint, and run out of a simple zip. I'll try:

(within Ubuntu terminal):
cd ~
sudo apt-get remove fastqc
wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
unzip fastqc_v0.11.5.zip
chmod +x ~/FastQC/fastqc
cd /mnt/c/Users/DMacKellar/Documents/Python/BioPython/Galaxy_rnaseq/
~/FastQC/fastqc Galaxy2-[adrenal_1.fastq].fastqsanger

Ok, now it works lickety-split. Even launching the GUI in an X-window, navigating to the dir with the fastq files in it, and executing it analyzes the data very rapidly. The sequences look pretty poor, and definitely in need of trimming: